kerchunk0.2.9
Published
Functions to make reference descriptions for ReferenceFileSystem
pip install kerchunk
Package Downloads
Authors
Project URLs
Requires Python
>=3.11
Dependencies
- fsspec
>=2025.2.0 - numcodecs
- numpy
- ujson
- zarr
>=3.0.1 - cftime
; extra == "cftime" - xarray
; extra == "fits" - h5py
; extra == "hdf" - xarray
; extra == "hdf" - cfgrib
; extra == "grib2" - scipy
; extra == "netcdf3" - cftime
; extra == "dev" - dask
; extra == "dev" - fastparquet
>=2024.11.0; extra == "dev" - h5netcdf
; extra == "dev" - h5py
; extra == "dev" - jinja2
; extra == "dev" - mypy
; extra == "dev" - pytest
; extra == "dev" - s3fs
; extra == "dev" - gcsfs
; extra == "dev" - types-ujson
; extra == "dev" - xarray
>=2024.10.0; extra == "dev" - cfgrib
; extra == "dev" - scipy
; extra == "dev" - netcdf4
; extra == "dev" - pytest-subtests
; extra == "dev"
kerchunk
Cloud-friendly access to archival data
Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient access to the data from traditional file systems or cloud object storage. It also provides a flexible way to create virtual datasets from multiple files. It does this by extracting the byte ranges, compression information and other information about the data and storing this metadata in a new, separate object. This means that you can create a virtual aggregate dataset over potentially many source files, for efficient, parallel and cloud-friendly in-situ access without having to copy or translate the originals. It is a gateway to in-the-cloud massive data processing while the data providers still insist on using legacy formats for archival storage.
Why Kerchunk:
We provide the following things:
- completely serverless architecture
- metadata consolidation, so you can understand a many-file dataset (metadata plus physical storage) in a single read
- read from all of the storage backends supported by fsspec, including object storage (s3, gcs, abfs, alibaba), http, cloud user storage (dropbox, gdrive) and network protocols (ftp, ssh, hdfs, smb...)
- loading of various file types (currently netcdf4/HDF, grib2, tiff, fits, zarr), potentially heterogeneous within a single dataset, without a need to go via the specific driver (e.g., no need for h5py)
- asynchronous concurrent fetch of many data chunks in one go, amortizing the cost of latency
- parallel access with a library like zarr without any locks
- logical datasets viewing many (>~millions) data files, and direct access/subselection to them via coordinate indexing across an arbitrary number of dimensions
For further information, please see the documentation.