unitxt1.26.6
unitxt1.26.6
Published
Load any mixture of text to text data in one line of code
pip install unitxt
Package Downloads
Authors
Project URLs
Requires Python
>=3.8
Dependencies
- datasets
>=2.16.0
- evaluate
- scipy
>=1.10.1
- diskcache
- ruff
; extra == "dev"
- pre-commit
; extra == "dev"
- detect-secrets
; extra == "dev"
- tomli
; extra == "dev"
- codespell
; extra == "dev"
- fuzzywuzzy
; extra == "dev"
- httpretty
; extra == "dev"
- psutil
; extra == "dev"
- sphinx_rtd_theme
; extra == "docs"
- piccolo_theme
; extra == "docs"
- sphinxext-opengraph
; extra == "docs"
- datasets
<4.0,>=2.16.0; extra == "docs"
- evaluate
; extra == "docs"
- nltk
; extra == "docs"
- rouge_score
; extra == "docs"
- scikit-learn
; extra == "docs"
- jiwer
; extra == "docs"
- editdistance
; extra == "docs"
- fuzzywuzzy
; extra == "docs"
- pydantic
; extra == "docs"
- crfm-helm
[unitxt]>=0.5.3; extra == "helm"
- torch
==1.12.1; extra == "service"
- fastapi
==0.109.0; extra == "service"
- uvicorn
[standard]==0.27.0.post1; extra == "service"
- python-jose
[cryptography]==3.3.0; extra == "service"
- transformers
; extra == "service"
- bert_score
; extra == "tests"
- transformers
; extra == "tests"
- sentence_transformers
; extra == "tests"
- ibm-cos-sdk
; extra == "tests"
- kaggle
==1.6.14; extra == "tests"
- opendatasets
; extra == "tests"
- httpretty
~=1.1.4; extra == "tests"
- editdistance
; extra == "tests"
- rouge-score
; extra == "tests"
- nltk
; extra == "tests"
- sacrebleu
[ja,ko]; extra == "tests"
- scikit-learn
<=1.5.2; extra == "tests"
- jiwer
; extra == "tests"
- conllu
; extra == "tests"
- llama-index-core
; extra == "tests"
- llama-index-llms-openai
; extra == "tests"
- pytrec-eval
; extra == "tests"
- SentencePiece
; extra == "tests"
- fuzzywuzzy
; extra == "tests"
- openai
; extra == "tests"
- ibm-generative-ai
; extra == "tests"
- bs4
; extra == "tests"
- tenacity
==8.3.0; extra == "tests"
- accelerate
; extra == "tests"
- func_timeout
==4.3.5; extra == "tests"
- Wikipedia-API
; extra == "tests"
- sqlglot
; extra == "tests"
- sqlparse
; extra == "tests"
- diskcache
; extra == "tests"
- pydantic
; extra == "tests"
- jsonschema_rs
; extra == "tests"
- gradio
; extra == "ui"
- transformers
; extra == "ui"
- sqlglot
; extra == "text2sql"
- func_timeout
==4.3.5; extra == "text2sql"
- sqlparse
; extra == "text2sql"
- tabulate
; extra == "text2sql"
- ibm-watsonx-ai
==1.2.10; extra == "watsonx"
- litellm
>=1.52.9; extra == "inference-tests"
- tenacity
; extra == "inference-tests"
- diskcache
; extra == "inference-tests"
- numpy
==1.26.4; extra == "inference-tests"
- ollama
; extra == "inference-tests"
- streamlit
; extra == "assistant"
- watchdog
; extra == "assistant"
- litellm
; extra == "assistant"
- litellm
>=1.52.9; extra == "remote-inference"
- tenacity
; extra == "remote-inference"
- diskcache
; extra == "remote-inference"
- transformers
; extra == "local-inference"
- torch
; extra == "local-inference"
- accelerate
; extra == "local-inference"
- unitxt
[remote_inference]; extra == "bluebench"
- unitxt
[local_inference]; extra == "bluebench"
- conllu
; extra == "bluebench"
- scikit-learn
; extra == "bluebench"
- sympy
; extra == "bluebench"
- bert_score
; extra == "bluebench"
- nltk
; extra == "bluebench"
- rouge_score
; extra == "bluebench"
- sacrebleu
[ko]; extra == "bluebench"
- unitxt
[base]; extra == "all"
- unitxt
[dev]; extra == "all"
- unitxt
[docs]; extra == "all"
- unitxt
[helm]; extra == "all"
- unitxt
[service]; extra == "all"
- unitxt
[tests]; extra == "all"
- unitxt
[ui]; extra == "all"
- unitxt
[watsonx]; extra == "all"
- unitxt
[assistant]; extra == "all"
- unitxt
[text2sql]; extra == "all"

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking
Why Unitxt?
- 🌐 Comprehensive: Evaluate text, tables, vision, speech, and code in one unified framework
- 💼 Enterprise-Ready: Battle-tested components with extensive catalog of benchmarks
- 🧠 Model Agnostic: Works with HuggingFace, OpenAI, WatsonX, and custom models
- 🔒 Reproducible: Shareable, modular components ensure consistent results
Quick Links
Installation
pip install unitxt
Quick Start
Command Line Evaluation
# Simple evaluation
unitxt-evaluate \
--tasks "card=cards.mmlu_pro.engineering" \
--model cross_provider \
--model_args "model_name=llama-3-1-8b-instruct" \
--limit 10
# Multi-task evaluation
unitxt-evaluate \
--tasks "card=cards.text2sql.bird+card=cards.mmlu_pro.engineering" \
--model cross_provider \
--model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
--split test \
--limit 10 \
--output_path ./results/evaluate_cli \
--log_samples \
--apply_chat_template
# Benchmark evaluation
unitxt-evaluate \
--tasks "benchmarks.tool_calling" \
--model cross_provider \
--model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
--split test \
--limit 10 \
--output_path ./results/evaluate_cli \
--log_samples \
--apply_chat_template
Loading as Dataset
Load thousands of datasets in chat API format, ready for any model:
from unitxt import load_dataset
dataset = load_dataset(
card="cards.gpqa.diamond",
split="test",
format="formats.chat_api",
)
📊 Available on The Catalog
🚀 Interactive Dashboard
Launch the graphical user interface to explore datasets and benchmarks:
pip install unitxt[ui]
unitxt-explore
Complete Python Example
Evaluate your own data with any model:
# Import required components
from unitxt import evaluate, create_dataset
from unitxt.blocks import Task, InputOutputTemplate
from unitxt.inference import HFAutoModelInferenceEngine
# Question-answer dataset
data = [
{"question": "What is the capital of Texas?", "answer": "Austin"},
{"question": "What is the color of the sky?", "answer": "Blue"},
]
# Define the task and evaluation metric
task = Task(
input_fields={"question": str},
reference_fields={"answer": str},
prediction_type=str,
metrics=["metrics.accuracy"],
)
# Create a template to format inputs and outputs
template = InputOutputTemplate(
instruction="Answer the following question.",
input_format="{question}",
output_format="{answer}",
postprocessors=["processors.lower_case"],
)
# Prepare the dataset
dataset = create_dataset(
task=task,
template=template,
format="formats.chat_api",
test_set=data,
split="test",
)
# Set up the model (supports Hugging Face, WatsonX, OpenAI, etc.)
model = HFAutoModelInferenceEngine(
model_name="Qwen/Qwen1.5-0.5B-Chat", max_new_tokens=32
)
# Generate predictions and evaluate
predictions = model(dataset)
results = evaluate(predictions=predictions, data=dataset)
# Print results
print("Global Results:\n", results.global_scores.summary)
print("Instance Results:\n", results.instance_scores.summary)
Contributing
Read the contributing guide for details on how to contribute to Unitxt.
Citation
If you use Unitxt in your research, please cite our paper:
@inproceedings{bandel-etal-2024-unitxt,
title = "Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative {AI}",
author = "Bandel, Elron and
Perlitz, Yotam and
Venezian, Elad and
Friedman, Roni and
Arviv, Ofir and
Orbach, Matan and
Don-Yehiya, Shachar and
Sheinwald, Dafna and
Gera, Ariel and
Choshen, Leshem and
Shmueli-Scheuer, Michal and
Katz, Yoav",
editor = "Chang, Kai-Wei and
Lee, Annie and
Rajani, Nazneen",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-demo.21",
pages = "207--215",
}