datasets3.2.0
datasets3.2.0
Published
HuggingFace community-driven open-source library of datasets
pip install datasets
Package Downloads
Authors
Requires Python
>=3.9.0
Dependencies
- filelock
- numpy
>=1.17
- pyarrow
>=15.0.0
- dill
<0.3.9,>=0.3.0
- pandas
- requests
>=2.32.2
- tqdm
>=4.66.3
- xxhash
- multiprocess
<0.70.17
- fsspec
[http]<=2024.9.0,>=2023.1.0
- aiohttp
- huggingface-hub
>=0.23.0
- packaging
- pyyaml
>=5.1
- soundfile
>=0.12.1; extra == "audio"
- librosa
; extra == "audio"
- soxr
>=0.4.0; python_version >= "3.9" and extra == "audio"
- tensorflow
==2.12.0; extra == "benchmarks"
- torch
==2.0.1; extra == "benchmarks"
- transformers
==4.30.1; extra == "benchmarks"
- absl-py
; extra == "dev"
- decorator
; extra == "dev"
- joblib
<1.3.0; extra == "dev"
- joblibspark
; extra == "dev"
- pytest
; extra == "dev"
- pytest-datadir
; extra == "dev"
- pytest-xdist
; extra == "dev"
- elasticsearch
<8.0.0,>=7.17.12; extra == "dev"
- faiss-cpu
>=1.8.0.post1; extra == "dev"
- lz4
; extra == "dev"
- moto
[server]; extra == "dev"
- pyspark
>=3.4; extra == "dev"
- py7zr
; extra == "dev"
- rarfile
>=4.0; extra == "dev"
- sqlalchemy
; extra == "dev"
- s3fs
>=2021.11.1; extra == "dev"
- protobuf
<4.0.0; extra == "dev"
- tiktoken
; extra == "dev"
- torch
>=2.0.0; extra == "dev"
- torchdata
; extra == "dev"
- soundfile
>=0.12.1; extra == "dev"
- transformers
>=4.42.0; extra == "dev"
- zstandard
; extra == "dev"
- polars
[timezone]>=0.20.0; extra == "dev"
- decord
==0.6.0; extra == "dev"
- Pillow
>=9.4.0; extra == "dev"
- librosa
; extra == "dev"
- ruff
>=0.3.0; extra == "dev"
- s3fs
; extra == "dev"
- transformers
; extra == "dev"
- torch
; extra == "dev"
- tensorflow
>=2.6.0; extra == "dev"
- tensorflow
>=2.6.0; python_version < "3.10" and extra == "dev"
- tensorflow
>=2.16.0; python_version >= "3.10" and extra == "dev"
- soxr
>=0.4.0; python_version >= "3.9" and extra == "dev"
- jax
>=0.3.14; sys_platform != "win32" and extra == "dev"
- jaxlib
>=0.3.14; sys_platform != "win32" and extra == "dev"
- s3fs
; extra == "docs"
- transformers
; extra == "docs"
- torch
; extra == "docs"
- tensorflow
>=2.6.0; extra == "docs"
- jax
>=0.3.14; extra == "jax"
- jaxlib
>=0.3.14; extra == "jax"
- ruff
>=0.3.0; extra == "quality"
- s3fs
; extra == "s3"
- tensorflow
>=2.6.0; extra == "tensorflow"
- tensorflow
>=2.6.0; extra == "tensorflow-gpu"
- absl-py
; extra == "tests"
- decorator
; extra == "tests"
- joblib
<1.3.0; extra == "tests"
- joblibspark
; extra == "tests"
- pytest
; extra == "tests"
- pytest-datadir
; extra == "tests"
- pytest-xdist
; extra == "tests"
- elasticsearch
<8.0.0,>=7.17.12; extra == "tests"
- faiss-cpu
>=1.8.0.post1; extra == "tests"
- lz4
; extra == "tests"
- moto
[server]; extra == "tests"
- pyspark
>=3.4; extra == "tests"
- py7zr
; extra == "tests"
- rarfile
>=4.0; extra == "tests"
- sqlalchemy
; extra == "tests"
- s3fs
>=2021.11.1; extra == "tests"
- protobuf
<4.0.0; extra == "tests"
- tiktoken
; extra == "tests"
- torch
>=2.0.0; extra == "tests"
- torchdata
; extra == "tests"
- soundfile
>=0.12.1; extra == "tests"
- transformers
>=4.42.0; extra == "tests"
- zstandard
; extra == "tests"
- polars
[timezone]>=0.20.0; extra == "tests"
- decord
==0.6.0; extra == "tests"
- Pillow
>=9.4.0; extra == "tests"
- librosa
; extra == "tests"
- tensorflow
>=2.6.0; python_version < "3.10" and extra == "tests"
- tensorflow
>=2.16.0; python_version >= "3.10" and extra == "tests"
- soxr
>=0.4.0; python_version >= "3.9" and extra == "tests"
- jax
>=0.3.14; sys_platform != "win32" and extra == "tests"
- jaxlib
>=0.3.14; sys_platform != "win32" and extra == "tests"
- absl-py
; extra == "tests-numpy2"
- decorator
; extra == "tests-numpy2"
- joblib
<1.3.0; extra == "tests-numpy2"
- joblibspark
; extra == "tests-numpy2"
- pytest
; extra == "tests-numpy2"
- pytest-datadir
; extra == "tests-numpy2"
- pytest-xdist
; extra == "tests-numpy2"
- elasticsearch
<8.0.0,>=7.17.12; extra == "tests-numpy2"
- lz4
; extra == "tests-numpy2"
- moto
[server]; extra == "tests-numpy2"
- pyspark
>=3.4; extra == "tests-numpy2"
- py7zr
; extra == "tests-numpy2"
- rarfile
>=4.0; extra == "tests-numpy2"
- sqlalchemy
; extra == "tests-numpy2"
- s3fs
>=2021.11.1; extra == "tests-numpy2"
- protobuf
<4.0.0; extra == "tests-numpy2"
- tiktoken
; extra == "tests-numpy2"
- torch
>=2.0.0; extra == "tests-numpy2"
- torchdata
; extra == "tests-numpy2"
- soundfile
>=0.12.1; extra == "tests-numpy2"
- transformers
>=4.42.0; extra == "tests-numpy2"
- zstandard
; extra == "tests-numpy2"
- polars
[timezone]>=0.20.0; extra == "tests-numpy2"
- decord
==0.6.0; extra == "tests-numpy2"
- Pillow
>=9.4.0; extra == "tests-numpy2"
- soxr
>=0.4.0; python_version >= "3.9" and extra == "tests_numpy2"
- jax
>=0.3.14; sys_platform != "win32" and extra == "tests_numpy2"
- jaxlib
>=0.3.14; sys_platform != "win32" and extra == "tests_numpy2"
- torch
; extra == "torch"
- Pillow
>=9.4.0; extra == "vision"
š¤ Datasets is a lightweight library providing two main features:
- one-line dataloaders for many public datasets: one-liners to download and pre-process any of the major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the HuggingFace Datasets Hub. With a simple command like
squad_dataset = load_dataset("squad")
, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), - efficient data pre-processing: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like
processed_dataset = dataset.map(process_example)
, efficiently prepare the dataset for inspection and ML model evaluation and training.
š Documentation š Find a dataset in the Hub š Share a dataset on the Hub