mlcroissant1.0.12
Published
MLCommons datasets format.
pip install mlcroissant
Package Downloads
Authors
Requires Python
Dependencies
- absl-py
- etils
[epath]>=1.7.0
- jsonpath-rw
- networkx
- pandas
- pandas-stubs
- python-dateutil
- rdflib
- requests
- tqdm
- black
==23.11.0; extra == "dev"
- flake8-docstrings
; extra == "dev"
- mlcroissant
[audio]; extra == "dev"
- mlcroissant
[beam]; extra == "dev"
- mlcroissant
[git]; extra == "dev"
- mlcroissant
[image]; extra == "dev"
- mlcroissant
[parquet]; extra == "dev"
- mypy
; extra == "dev"
- pyflakes
; extra == "dev"
- pygraphviz
; extra == "dev"
- pytest
; extra == "dev"
- pytype
; extra == "dev"
- torchdata
; extra == "dev"
- librosa
; extra == "audio"
- soxr
==0.4.0b1; extra == "audio"
- apache-beam
; extra == "beam"
- GitPython
; extra == "git"
- Pillow
; extra == "image"
- pyarrow
; extra == "parquet"
mlcroissant 🥐
Discover mlcroissant 🥐
with this
introduction tutorial in Google Colab.
Python requirements
Python version >= 3.10.
If you do not have a Python environment:
python3 -m venv ~/py3
source ~/py3/bin/activate
Install
python -m pip install ".[dev]"
The command can fail, for example, due to missing dependencies, e.g.:
Failed to build pygraphviz
ERROR: Could not build wheels for pygraphviz, which is required to install pyproject.toml-based projects
This can be fixed by running
sudo apt-get install python3-dev graphviz libgraphviz-dev pkg-config
Conda installation
Conda can help create a consistent environment. It can also be useful to install packages without root access. To use Conda, run:
conda create --name croissant python=3.10 -y
conda activate croissant
conda install graphviz
python3 -m pip install ".[dev]"
Verify/load a Croissant dataset
mlcroissant validate --jsonld ../../datasets/titanic/metadata.json
The command:
- Exits with 0, prints
Done
and displays encountered warnings, when no error was found in the file. - Exits with 1 and displays all encountered errors/warnings, otherwise.
Similarly, you can generate a dataset by launching:
mlcroissant load \
--jsonld ../../datasets/titanic/metadata.json \
--record_set passengers \
--num_records 10
Loading a distribution
via git+https
If the encodingFormat
of a distribution
is git+https
, please provide the username and password by setting the CROISSANT_GIT_USERNAME
and CROISSANT_GIT_PASSWORD
environment variables. These will be used to construct the authentication necessary to load the distribution.
Note that, for datasets hosted on HuggingFace, CROISSANT_GIT_USERNAME
and CROISSANT_GIT_PASSWORD
should correspond respectively to your HuggingFace's username and User Access Token. User Access Tokens can be generated following this guide.
Loading a distribution
via HTTP with Basic Auth
If the contentUrl
of a distribution
requires authentication via Basic Auth, please provide the username and password by setting the CROISSANT_BASIC_AUTH_USERNAME
and CROISSANT_BASIC_AUTH_PASSWORD
environment variables. These will be used to construct the authentication necessary to load the distribution.
Programmatically build JSON-LD files
You can programmatically build Croissant JSON-LD files using the Python API.
import mlcroissant as mlc
metadata=mlc.nodes.Metadata(
name="...",
)
metadata.to_json() # this returns the JSON-LD file.
Add new properties to the standard
Nodes (Metadata, RecordSets, etc) implement PEP 681. So you can declare RDF triplets using the dataclass syntax.
Example 1: implement CreativeWork:
@mlc_dataclasses.dataclass
class CreativeWork(Node):
JSONLD_TYPE = SDO.CreativeWork # https://schema.org/CreativeWork
name: str | None = mlc_dataclasses.jsonld_field(
cardinality="ONE", # Cardinality can be ONE or MANY
default=None, # Specify the default value in Python
description="The name of the item.", # The full description
input_types=[SDO.Text], # The schema.org type
url=SDO.name, # The URL of the property
)
Example 2: implement RecordSet:
@mlc_dataclasses.dataclass
class RecordSet(Node):
JSONLD_TYPE = constants.ML_COMMONS_RECORD_SET_TYPE
fields: list[Field] = mlc_dataclasses.jsonld_field(
cardinality="MANY", # Example with cardinality=="MANY"
default_factory=list,
description=(
"A data element that appears in the records of the RecordSet (e.g., one"
" column of a table)."
),
input_types=[Field], # Types can also be other nodes (here `Field`)
url=constants.ML_COMMONS_FIELD,
)
Example 3: specify a version (by default all versions):
@mlc_dataclasses.dataclass
class Field(Node):
is_enumeration: bool | None = mlc_dataclasses.jsonld_field(
default=None,
input_types=[SDO.Boolean],
url=constants.ML_COMMONS_IS_ENUMERATION,
versions=[CroissantVersion.V_0_8], # `is_enumeration` is only valid for v0.8, not v1.0
)
Run tests
All tests can be run from the Makefile:
make tests
Note that git lfs
should be installed to successfully pass all tests:
git lfs install
Design
The most important modules in the library are:
mlcroissant/_src/structure_graph
is responsible for the static analysis of the Croissant files. We convert Croissant files to a Python representation called "structure graph" (using NetworkX). In the process, we catch any static analysis issues (e.g., a missing mandatory field or a logic problem in the file).mlcroissant/_src/operation_graph
is responsible for the dynamic analysis of the Croissant files (i.e., actually loading the dataset by yielding examples). We convert the structure graph into an "operation graph". Operations are the unit transformations that allow to build the dataset (likeDownload
,Extract
, etc).
Other important modules are:
mlcroissant/_src/core
defines all needed core internals. For instance,Issues
are a way to track errors and warning during the analysis of Croissant files.mlcroissant/__init__.py
declares the public API withmlcroissant.Dataset
.
For the full design, refer to the design doc for an overview of the implementation.
Caching. By default, all downloaded/extracted files are cached in ~/.cache/croissant
, but you can overwrite this by setting the environment variable $CROISSANT_CACHE
.
Contribute
All contributions are welcome! We even have good first issues to start in the project. Refer to the GitHub project for more detailed user stories and read above how the repo is designed.
An easy way to contribute to mlcroissant
is using Croissant's configured codespaces.
To start a codespace:
- On Croissant's main repo page, click on the
<Code>
button and select theCodespaces
tab. You can start a new codespace by clicking on the+
sign on the left side of the tab. By default, the codespace will start on Croissant'smain
branch, unless you select otherwise from the branches drop-down menu on the left side. - While building the environment, your codespaces will install all
mlcroissant
's required dependencies - so that you can start coding right away! Of course, you can further personalize your codespace. - To start contributing to Croissant:
- Create a new branch from the
Terminal
tab in the bottom panel of your codespace withgit checkout -b feature/my-awesome-new-feature
- You can create new commits, and run most git commands from the
Source Control
tab in the left panel of your codespace. Alternatively, use theTerminal
in the bottom panel of your codespace. - Iterate on your code until all tests are green (you can run tests with
make pytest
or form theTests
tab in the left panel of your codespace). - Open a pull request (PR) with the main branch of https://github.com/mlcommons/croissant, and ask for feedback!
- Create a new branch from the
Alternatively, you can contribute to mlcroissant
using the "classic" GitHub workflow:
Debug
You can debug the validation of the file using the --debug
flag:
mlcroissant validate --jsonld ../../datasets/titanic/metadata.json --debug
This will:
- print extra information, like the generated nodes;
- save the generated structure graph to a folder indicated in the logs.
Publishing packages
To publish a package,
- Bump the version in
croissant/python/mlcroissant/pyproject.toml
, and merge your PR. - Publish a new release in GitHub, and add a tag to it with the newest version in
pyproject.toml
. Ensure that the new release is marked aslatest
. The workflow scriptpython-publish.yml
will trigger and publish the package to PyPI.