Oven logo

Oven

Published

Machine Learning dataset loaders

pip install ml-datasets

Package Downloads

Weekly DownloadsMonthly Downloads

Project URLs

Requires Python

>=3.6

Machine learning dataset loaders for testing and examples

Loaders for various machine learning datasets for testing and example scripts. Previously in thinc.extra.datasets.

PyPi Version

Setup and installation

The package can be installed via pip:

pip install ml-datasets

Loaders

Loaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments – see the source for details.

# Import directly
from ml_datasets import imdb
train_data, dev_data = imdb()
# Load via registry
from ml_datasets import loaders
imdb_loader = loaders.get("imdb")
train_data, dev_data = imdb_loader()

Available loaders

NLP datasets

ID / FunctionDescriptionNLP taskFrom URL
imdbIMDB sentiment datasetBinary classification: sentiment analysis
dbpediaDBPedia ontology datasetMulti-class single-label classification
cmuCMU movie genres datasetMulti-class, multi-label classification
quora_questionsDuplicate Quora questions datasetDetecting duplicate questions
reutersReuters dataset (texts not included)Multi-class multi-label classification
snliStanford Natural Language Inference corpusRecognizing textual entailment
stack_exchangeStack Exchange datasetQuestion Answering
ud_ancora_pos_tagsUniversal Dependencies Spanish AnCora corpusPOS tagging
ud_ewtb_pos_tagsUniversal Dependencies English EWT corpusPOS tagging
wikinerWikiNER dataNamed entity recognition

Other ML datasets

ID / FunctionDescriptionML taskFrom URL
mnistMNIST dataImage recognition

Dataset details

IMDB

Each instance contains the text of a movie review, and a sentiment expressed as 0 or 1.

train_data, dev_data = ml_datasets.imdb()
for text, annot in train_data[0:5]:
    print(f"Review: {text}")
    print(f"Sentiment: {annot}")
PropertyTrainingDev
# Instances2500025000
Label values{0, 1}{0, 1}
Labels per instanceSingleSingle
Label distributionBalanced (50/50)Balanced (50/50)

DBPedia

Each instance contains an ontological description, and a classification into one of the 14 distinct labels.

train_data, dev_data = ml_datasets.dbpedia()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Category: {annot}")
PropertyTrainingDev
# Instances56000070000
Label values1-141-14
Labels per instanceSingleSingle
Label distributionBalancedBalanced

CMU

Each instance contains a movie description, and a classification into a list of appropriate genres.

train_data, dev_data = ml_datasets.cmu()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Genres: {annot}")
PropertyTrainingDev
# Instances417930
Label values363 different genres-
Labels per instanceMultiple-
Label distributionImbalanced: 147 labels with less than 20 examples, while Drama occurs more than 19000 times-

Quora

train_data, dev_data = ml_datasets.quora_questions()
for questions, annot in train_data[0:50]:
    q1, q2 = questions
    print(f"Question 1: {q1}")
    print(f"Question 2: {q2}")
    print(f"Similarity: {annot}")

Each instance contains two quora questions, and a label indicating whether or not they are duplicates (0: no, 1: yes). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

PropertyTrainingDev
# Instances36385940429
Label values{0, 1}{0, 1}
Labels per instanceSingleSingle
Label distributionImbalanced: 63% label 0Imbalanced: 63% label 0

Registering loaders

Loaders can be registered externally using the loaders registry as a decorator. For example:

@ml_datasets.loaders("my_custom_loader")
def my_custom_loader():
    return load_some_data()

assert "my_custom_loader" in ml_datasets.loaders