kedro-datasets5.1.0
Published
Kedro-Datasets is where you can find all of Kedro's data connectors.
pip install kedro-datasets
Package Downloads
Authors
Project URLs
Requires Python
>=3.10
Dependencies
- kedro
>=0.19.7
- lazy-loader
- kedro-datasets
[docs,test]; extra == "all"
- kedro-datasets
[api-apidataset]; extra == "api"
- requests
~=2.20; extra == "api-apidataset"
- kedro-datasets
[biosequence-biosequencedataset]; extra == "biosequence"
- biopython
~=1.73; extra == "biosequence-biosequencedataset"
- kedro-datasets
[dask-csvdataset,dask-parquetdataset]; extra == "dask"
- dask
[dataframe]>=2021.10; extra == "dask-csvdataset"
- dask
[complete]>=2021.10; extra == "dask-parquetdataset"
- triad
<1.0,>=0.6.7; extra == "dask-parquetdataset"
- kedro-datasets
[databricks-managedtabledataset]; extra == "databricks"
- kedro-datasets
[delta-base,hdfs-base,pandas-base,s3fs-base,spark-base]; extra == "databricks-managedtabledataset"
- delta-spark
<4.0,>=1.0; extra == "delta-base"
- kedro-sphinx-theme
==2024.10.2; extra == "docs"
- ipykernel
<7.0,>=5.3; extra == "docs"
- Jinja2
<3.2.0; extra == "docs"
- langchain-openai
; extra == "experimental"
- langchain-cohere
; extra == "experimental"
- langchain-anthropic
; extra == "experimental"
- langchain-community
; extra == "experimental"
- h5netcdf
>=1.2.0; extra == "experimental"
- netcdf4
>=1.6.4; extra == "experimental"
- xarray
>=2023.1.0; extra == "experimental"
- rioxarray
; extra == "experimental"
- torch
; extra == "experimental"
- prophet
>=1.1.5; extra == "experimental"
- kedro-datasets
[geopandas-genericdataset]; extra == "geopandas"
- geopandas
<2.0,>=0.8.0; extra == "geopandas-genericdataset"
- fiona
<2.0,>=1.8; extra == "geopandas-genericdataset"
- hdfs
<3.0,>=2.5.8; extra == "hdfs-base"
- kedro-datasets
[holoviews-holoviewswriter]; extra == "holoviews"
- holoviews
>=1.13.0; extra == "holoviews-holoviewswriter"
- kedro-datasets
[huggingface-hfdataset,huggingface-hftransformerpipelinedataset]; extra == "huggingface"
- datasets
; extra == "huggingface-hfdataset"
- huggingface-hub
; extra == "huggingface-hfdataset"
- transformers
; extra == "huggingface-hftransformerpipelinedataset"
- ibis-framework
; extra == "ibis"
- ibis-framework
[bigquery]; extra == "ibis-bigquery"
- ibis-framework
[clickhouse]; extra == "ibis-clickhouse"
- ibis-framework
[dask]; extra == "ibis-dask"
- ibis-framework
[datafusion]; extra == "ibis-datafusion"
- ibis-framework
[druid]; extra == "ibis-druid"
- ibis-framework
[duckdb]; extra == "ibis-duckdb"
- ibis-framework
[exasol]; extra == "ibis-exasol"
- ibis-framework
; extra == "ibis-flink"
- apache-flink
; extra == "ibis-flink"
- ibis-framework
[impala]; extra == "ibis-impala"
- ibis-framework
[mssql]; extra == "ibis-mssql"
- ibis-framework
[mysql]; extra == "ibis-mysql"
- ibis-framework
[oracle]; extra == "ibis-oracle"
- ibis-framework
[pandas]; extra == "ibis-pandas"
- ibis-framework
[polars]; extra == "ibis-polars"
- ibis-framework
[postgres]; extra == "ibis-postgres"
- ibis-framework
[pyspark]; extra == "ibis-pyspark"
- ibis-framework
[risingwave]; extra == "ibis-risingwave"
- ibis-framework
[snowflake]; extra == "ibis-snowflake"
- ibis-framework
[sqlite]; extra == "ibis-sqlite"
- ibis-framework
[trino]; extra == "ibis-trino"
- kedro-datasets
[json-jsondataset]; extra == "json"
- kedro-datasets
[langchain-chatanthropicdataset,langchain-chatcoheredataset,langchain-chatopenaidataset,langchain-openaiembeddingsdataset]; extra == "langchain"
- langchain-anthropic
~=0.1.13; extra == "langchain-chatanthropicdataset"
- langchain-community
~=0.2.0; extra == "langchain-chatanthropicdataset"
- langchain-cohere
~=0.1.5; extra == "langchain-chatcoheredataset"
- langchain-community
~=0.2.0; extra == "langchain-chatcoheredataset"
- langchain-openai
~=0.1.7; extra == "langchain-chatopenaidataset"
- langchain-openai
~=0.1.7; extra == "langchain-openaiembeddingsdataset"
- kedro-datasets
[matlab-matlabdataset]; extra == "matlab"
- scipy
; extra == "matlab-matlabdataset"
- kedro-datasets
[matplotlib-matplotlibwriter]; extra == "matplotlib"
- matplotlib
<4.0,>=3.0.3; extra == "matplotlib-matplotlibwriter"
- kedro-datasets
[netcdf-netcdfdataset]; extra == "netcdf"
- h5netcdf
>=1.2.0; extra == "netcdf-netcdfdataset"
- netcdf4
>=1.6.4; extra == "netcdf-netcdfdataset"
- xarray
>=2023.1.0; extra == "netcdf-netcdfdataset"
- kedro-datasets
[networkx-base]; extra == "networkx"
- networkx
~=2.4; extra == "networkx-base"
- kedro-datasets
[networkx-base]; extra == "networkx-gmldataset"
- kedro-datasets
[networkx-base]; extra == "networkx-graphmldataset"
- kedro-datasets
[networkx-base]; extra == "networkx-jsondataset"
- kedro-datasets
[pandas-csvdataset,pandas-deltatabledataset,pandas-exceldataset,pandas-featherdataset,pandas-gbqquerydataset,pandas-gbqtabledataset,pandas-genericdataset,pandas-hdfdataset,pandas-jsondataset,pandas-parquetdataset,pandas-sqlquerydataset,pandas-sqltabledataset,pandas-xmldataset]; extra == "pandas"
- pandas
<3.0,>=1.3; extra == "pandas-base"
- kedro-datasets
[pandas-base]; extra == "pandas-csvdataset"
- kedro-datasets
[pandas-base]; extra == "pandas-deltatabledataset"
- deltalake
>=0.10.0; extra == "pandas-deltatabledataset"
- kedro-datasets
[pandas-base]; extra == "pandas-exceldataset"
- openpyxl
<4.0,>=3.0.6; extra == "pandas-exceldataset"
- kedro-datasets
[pandas-base]; extra == "pandas-featherdataset"
- kedro-datasets
[pandas-base]; extra == "pandas-gbqquerydataset"
- pandas-gbq
>=0.12.0; extra == "pandas-gbqquerydataset"
- kedro-datasets
[pandas-base]; extra == "pandas-gbqtabledataset"
- pandas-gbq
>=0.12.0; extra == "pandas-gbqtabledataset"
- kedro-datasets
[pandas-base]; extra == "pandas-genericdataset"
- kedro-datasets
[pandas-base]; extra == "pandas-hdfdataset"
- tables
>=3.6; extra == "pandas-hdfdataset"
- kedro-datasets
[pandas-base]; extra == "pandas-jsondataset"
- kedro-datasets
[pandas-base]; extra == "pandas-parquetdataset"
- pyarrow
>=6.0; extra == "pandas-parquetdataset"
- kedro-datasets
[pandas-base]; extra == "pandas-sqlquerydataset"
- SQLAlchemy
<3.0,>=1.4; extra == "pandas-sqlquerydataset"
- pyodbc
>=4.0; extra == "pandas-sqlquerydataset"
- kedro-datasets
[pandas-base]; extra == "pandas-sqltabledataset"
- SQLAlchemy
<3.0,>=1.4; extra == "pandas-sqltabledataset"
- kedro-datasets
[pandas-base]; extra == "pandas-xmldataset"
- lxml
~=4.6; extra == "pandas-xmldataset"
- kedro-datasets
[pickle-pickledataset]; extra == "pickle"
- compress-pickle
[lz4]~=2.1.0; extra == "pickle-pickledataset"
- kedro-datasets
[pillow-imagedataset]; extra == "pillow"
- Pillow
>=9.0; extra == "pillow-imagedataset"
- kedro-datasets
[plotly-htmldataset,plotly-jsondataset,plotly-plotlydataset]; extra == "plotly"
- plotly
<6.0,>=4.8.0; extra == "plotly-base"
- kedro-datasets
[plotly-base]; extra == "plotly-htmldataset"
- kedro-datasets
[plotly-base]; extra == "plotly-jsondataset"
- kedro-datasets
[pandas-base,plotly-base]; extra == "plotly-plotlydataset"
- kedro-datasets
[polars-csvdataset,polars-eagerpolarsdataset,polars-lazypolarsdataset]; extra == "polars"
- polars
>=0.18.0; extra == "polars-base"
- kedro-datasets
[polars-base]; extra == "polars-csvdataset"
- kedro-datasets
[polars-base]; extra == "polars-eagerpolarsdataset"
- pyarrow
>=4.0; extra == "polars-eagerpolarsdataset"
- xlsx2csv
>=0.8.0; extra == "polars-eagerpolarsdataset"
- deltalake
>=0.6.2; extra == "polars-eagerpolarsdataset"
- kedro-datasets
[polars-base]; extra == "polars-lazypolarsdataset"
- pyarrow
>=4.0; extra == "polars-lazypolarsdataset"
- deltalake
>=0.6.2; extra == "polars-lazypolarsdataset"
- kedro-datasets
[prophet]; extra == "prophet"
- prophet
>=1.1.5; extra == "prophet-dataset"
- kedro-datasets
[pytorch-dataset]; extra == "pytorch"
- torch
; extra == "pytorch-dataset"
- kedro-datasets
[redis-pickledataset]; extra == "redis"
- redis
~=4.1; extra == "redis-pickledataset"
- kedro-datasets
[rioxarray-geotiffdataset]; extra == "rioxarray"
- rioxarray
>=0.15.0; extra == "rioxarray-geotiffdataset"
- s3fs
>=2021.4; extra == "s3fs-base"
- kedro-datasets
[snowflake-snowparktabledataset]; extra == "snowflake"
- snowflake-snowpark-python
~=1.0; extra == "snowflake-snowparktabledataset"
- kedro-datasets
[spark-deltatabledataset,spark-sparkdataset,spark-sparkhivedataset,spark-sparkjdbcdataset,spark-sparkstreamingdataset]; extra == "spark"
- pyspark
<4.0,>=2.2; extra == "spark-base"
- kedro-datasets
[delta-base,hdfs-base,s3fs-base,spark-base]; extra == "spark-deltatabledataset"
- kedro-datasets
[hdfs-base,s3fs-base,spark-base]; extra == "spark-sparkdataset"
- kedro-datasets
[hdfs-base,s3fs-base,spark-base]; extra == "spark-sparkhivedataset"
- kedro-datasets
[spark-base]; extra == "spark-sparkjdbcdataset"
- kedro-datasets
[hdfs-base,s3fs-base,spark-base]; extra == "spark-sparkstreamingdataset"
- kedro-datasets
[svmlight-svmlightdataset]; extra == "svmlight"
- scikit-learn
>=1.0.2; extra == "svmlight-svmlightdataset"
- scipy
>=1.7.3; extra == "svmlight-svmlightdataset"
- kedro-datasets
[tensorflow-tensorflowmodeldataset]; extra == "tensorflow"
- tensorflow
~=2.0; (platform_system != "Darwin" or platform_machine != "arm64") and extra == "tensorflow-tensorflowmodeldataset"
- tensorflow-macos
~=2.0; (platform_system == "Darwin" and platform_machine == "arm64") and extra == "tensorflow-tensorflowmodeldataset"
- accelerate
<0.32; extra == "test"
- adlfs
~=2023.1; extra == "test"
- bandit
<2.0,>=1.6.2; extra == "test"
- behave
==1.2.6; extra == "test"
- biopython
~=1.73; extra == "test"
- blacken-docs
==1.9.2; extra == "test"
- black
~=22.0; extra == "test"
- cloudpickle
<=2.0.0; extra == "test"
- compress-pickle
[lz4]~=2.1.0; extra == "test"
- coverage
>=7.2.0; extra == "test"
- dask
[complete]>=2021.10; extra == "test"
- delta-spark
<3.0,>=1.0; extra == "test"
- deltalake
>=0.10.0; extra == "test"
- dill
~=0.3.1; extra == "test"
- filelock
<4.0,>=3.4.0; extra == "test"
- fiona
<2.0,>=1.8; extra == "test"
- gcsfs
<2023.3,>=2023.1; extra == "test"
- geopandas
<2.0,>=0.8.0; extra == "test"
- hdfs
<3.0,>=2.5.8; extra == "test"
- holoviews
>=1.13.0; extra == "test"
- ibis-framework
[duckdb,examples]; extra == "test"
- import-linter
[toml]==1.2.6; extra == "test"
- ipython
<8.0,>=7.31.1; extra == "test"
- Jinja2
<3.2.0; extra == "test"
- joblib
>=0.14; extra == "test"
- jupyterlab
>=3.0; extra == "test"
- jupyter
~=1.0; extra == "test"
- lxml
~=4.6; extra == "test"
- matplotlib
<3.6,>=3.5; extra == "test"
- memory-profiler
<1.0,>=0.50.0; extra == "test"
- moto
==5.0.0; extra == "test"
- mypy
~=1.0; extra == "test"
- networkx
~=2.4; extra == "test"
- opencv-python
~=4.5.5.64; extra == "test"
- openpyxl
<4.0,>=3.0.3; extra == "test"
- pandas-gbq
>=0.12.0; extra == "test"
- pandas
>=2.0; extra == "test"
- Pillow
~=10.0; extra == "test"
- plotly
<6.0,>=4.8.0; extra == "test"
- polars
[deltalake,xlsx2csv]~=0.18.0; extra == "test"
- pre-commit
>=2.9.2; extra == "test"
- pyodbc
~=5.0; extra == "test"
- pytest-cov
~=3.0; extra == "test"
- pytest-mock
<2.0,>=1.7.1; extra == "test"
- pytest-xdist
[psutil]~=2.2.1; extra == "test"
- pytest
~=7.2; extra == "test"
- redis
~=4.1; extra == "test"
- requests-mock
~=1.6; extra == "test"
- requests
~=2.20; extra == "test"
- ruff
~=0.0.290; extra == "test"
- s3fs
>=2021.04; extra == "test"
- scikit-learn
<2,>=1.0.2; extra == "test"
- scipy
>=1.7.3; extra == "test"
- packaging
; extra == "test"
- SQLAlchemy
>=1.2; extra == "test"
- tables
>=3.6; extra == "test"
- triad
<1.0,>=0.6.7; extra == "test"
- trufflehog
~=2.1; extra == "test"
- xarray
>=2023.1.0; extra == "test"
- xlsxwriter
~=1.0; extra == "test"
- datasets
; extra == "test"
- huggingface-hub
; extra == "test"
- transformers
[torch]; extra == "test"
- types-cachetools
; extra == "test"
- types-PyYAML
; extra == "test"
- types-redis
; extra == "test"
- types-requests
; extra == "test"
- types-decorator
; extra == "test"
- types-six
; extra == "test"
- types-tabulate
; extra == "test"
- tensorflow
~=2.0; (platform_system != "Darwin" or platform_machine != "arm64") and extra == "test"
- tensorflow-macos
~=2.0; (platform_system == "Darwin" and platform_machine == "arm64") and extra == "test"
- pyarrow
>=1.0; python_version < "3.11" and extra == "test"
- pyspark
>=3.0; python_version < "3.11" and extra == "test"
- snowflake-snowpark-python
~=1.0; python_version < "3.11" and extra == "test"
- pyarrow
>=7.0; python_version >= "3.11" and extra == "test"
- pyspark
>=3.4; python_version >= "3.11" and extra == "test"
- kedro-datasets
[text-textdataset]; extra == "text"
- kedro-datasets
[tracking-jsondataset,tracking-metricsdataset]; extra == "tracking"
- kedro-datasets
[video-videodataset]; extra == "video"
- opencv-python
~=4.5.5.64; extra == "video-videodataset"
- kedro-datasets
[yaml-yamldataset]; extra == "yaml"
- kedro-datasets
[pandas-base]; extra == "yaml-yamldataset"
- PyYAML
<7.0,>=4.2; extra == "yaml-yamldataset"
Kedro-Datasets
Welcome to kedro_datasets
, the home of Kedro's data connectors. Here you will find AbstractDataset
implementations powering Kedro's DataCatalog created by QuantumBlack and external contributors.
Installation
kedro-datasets
is a Python plugin. To install it:
pip install kedro-datasets
Install dependencies at a group-level
Datasets are organised into groups e.g. pandas
, spark
and pickle
. Each group has a collection of datasets, e.g.pandas.CSVDataset
, pandas.ParquetDataset
and more. You can install dependencies for an entire group of dependencies as follows:
pip install "kedro-datasets[<group>]"
This installs Kedro-Datasets and dependencies related to the dataset group. An example of this could be a workflow that depends on the data types in pandas
. Run pip install 'kedro-datasets[pandas]'
to install Kedro-Datasets and the dependencies for the datasets in the pandas
group.
Install dependencies at a type-level
To limit installation to dependencies specific to a dataset:
pip install "kedro-datasets[<group>-<dataset>]"
For example, your workflow might require the pandas.ExcelDataset
, so to install its dependencies, run pip install "kedro-datasets[pandas-exceldataset]"
.
From `kedro-datasets` version 3.0.0 onwards, the names of the optional dataset-level dependencies have been normalised to follow [PEP 685](https://peps.python.org/pep-0685/). The '.' character has been replaced with a '-' character and the names are in lowercase. For example, if you had `kedro-datasets[pandas.ExcelDataset]` in your requirements file, it would have to be changed to `kedro-datasets[pandas-exceldataset]`.
What AbstractDataset
implementations are supported?
We support a range of data connectors, including CSV, Excel, Parquet, Feather, HDF5, JSON, Pickle, SQL Tables, SQL Queries, Spark DataFrames and more. We even allow support for working with images.
These data connectors are supported with the APIs of pandas
, spark
, networkx
, matplotlib
, yaml
and more.
The Data Catalog allows you to work with a range of file formats on local file systems, network file systems, cloud object stores, and Hadoop.
Here is a full list of supported data connectors and APIs.
How can I create my own AbstractDataset
implementation?
Take a look at our instructions on how to create your own AbstractDataset
implementation.
Can I contribute?
Yes! Want to help build Kedro-Datasets? Check out our guide to contributing.
What licence do you use?
Kedro-Datasets is licensed under the Apache 2.0 License.
Python version support policy
- The Kedro-Datasets package follows the NEP 29 Python version support policy.