4.3.0
Dataset Features
Enable large scale distributed dataset streaming:
- Keep hffs cache in workers when streaming by @lhoestq in https://github.com/huggingface/datasets/pull/7820
- Retry open hf file by @lhoestq in https://github.com/huggingface/datasets/pull/7822
These improvements require huggingface_hub>=1.1.0 to take full effect
What's Changed
- fix conda deps by @lhoestq in https://github.com/huggingface/datasets/pull/7810
- Add pyarrow's binary view to features by @delta003 in https://github.com/huggingface/datasets/pull/7795
- Fix polars cast column image by @CloseChoice in https://github.com/huggingface/datasets/pull/7800
- Allow streaming hdf5 files by @lhoestq in https://github.com/huggingface/datasets/pull/7814
- Fix batch_size default description in to_polars docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/7824
- docs: document_dataset PDFs & OCR by @ethanknights in https://github.com/huggingface/datasets/pull/7812
- Add custom fingerprint support to
from_generatorby @simonreise in https://github.com/huggingface/datasets/pull/7533 - picklable batch_fn by @lhoestq in https://github.com/huggingface/datasets/pull/7826
New Contributors
- @delta003 made their first contribution in https://github.com/huggingface/datasets/pull/7795
- @CloseChoice made their first contribution in https://github.com/huggingface/datasets/pull/7800
- @ethanknights made their first contribution in https://github.com/huggingface/datasets/pull/7812
- @simonreise made their first contribution in https://github.com/huggingface/datasets/pull/7533
Full Changelog: https://github.com/huggingface/datasets/compare/4.2.0...4.3.0