unitxt1.26.6
unitxt1.26.6
Published
Load any mixture of text to text data in one line of code
pip install unitxt
Package Downloads
Authors
Project URLs
Requires Python
>=3.8
Dependencies
- datasets
>=2.16.0 - evaluate
- scipy
>=1.10.1 - diskcache
- ruff
; extra == "dev" - pre-commit
; extra == "dev" - detect-secrets
; extra == "dev" - tomli
; extra == "dev" - codespell
; extra == "dev" - fuzzywuzzy
; extra == "dev" - httpretty
; extra == "dev" - psutil
; extra == "dev" - sphinx_rtd_theme
; extra == "docs" - piccolo_theme
; extra == "docs" - sphinxext-opengraph
; extra == "docs" - datasets
<4.0,>=2.16.0; extra == "docs" - evaluate
; extra == "docs" - nltk
; extra == "docs" - rouge_score
; extra == "docs" - scikit-learn
; extra == "docs" - jiwer
; extra == "docs" - editdistance
; extra == "docs" - fuzzywuzzy
; extra == "docs" - pydantic
; extra == "docs" - crfm-helm
[unitxt]>=0.5.3; extra == "helm" - torch
==1.12.1; extra == "service" - fastapi
==0.109.0; extra == "service" - uvicorn
[standard]==0.27.0.post1; extra == "service" - python-jose
[cryptography]==3.3.0; extra == "service" - transformers
; extra == "service" - bert_score
; extra == "tests" - transformers
; extra == "tests" - sentence_transformers
; extra == "tests" - ibm-cos-sdk
; extra == "tests" - kaggle
==1.6.14; extra == "tests" - opendatasets
; extra == "tests" - httpretty
~=1.1.4; extra == "tests" - editdistance
; extra == "tests" - rouge-score
; extra == "tests" - nltk
; extra == "tests" - sacrebleu
[ja,ko]; extra == "tests" - scikit-learn
<=1.5.2; extra == "tests" - jiwer
; extra == "tests" - conllu
; extra == "tests" - llama-index-core
; extra == "tests" - llama-index-llms-openai
; extra == "tests" - pytrec-eval
; extra == "tests" - SentencePiece
; extra == "tests" - fuzzywuzzy
; extra == "tests" - openai
; extra == "tests" - ibm-generative-ai
; extra == "tests" - bs4
; extra == "tests" - tenacity
==8.3.0; extra == "tests" - accelerate
; extra == "tests" - func_timeout
==4.3.5; extra == "tests" - Wikipedia-API
; extra == "tests" - sqlglot
; extra == "tests" - sqlparse
; extra == "tests" - diskcache
; extra == "tests" - pydantic
; extra == "tests" - jsonschema_rs
; extra == "tests" - gradio
; extra == "ui" - transformers
; extra == "ui" - sqlglot
; extra == "text2sql" - func_timeout
==4.3.5; extra == "text2sql" - sqlparse
; extra == "text2sql" - tabulate
; extra == "text2sql" - ibm-watsonx-ai
==1.2.10; extra == "watsonx" - litellm
>=1.52.9; extra == "inference-tests" - tenacity
; extra == "inference-tests" - diskcache
; extra == "inference-tests" - numpy
==1.26.4; extra == "inference-tests" - ollama
; extra == "inference-tests" - streamlit
; extra == "assistant" - watchdog
; extra == "assistant" - litellm
; extra == "assistant" - litellm
>=1.52.9; extra == "remote-inference" - tenacity
; extra == "remote-inference" - diskcache
; extra == "remote-inference" - transformers
; extra == "local-inference" - torch
; extra == "local-inference" - accelerate
; extra == "local-inference" - unitxt
[remote_inference]; extra == "bluebench" - unitxt
[local_inference]; extra == "bluebench" - conllu
; extra == "bluebench" - scikit-learn
; extra == "bluebench" - sympy
; extra == "bluebench" - bert_score
; extra == "bluebench" - nltk
; extra == "bluebench" - rouge_score
; extra == "bluebench" - sacrebleu
[ko]; extra == "bluebench" - unitxt
[base]; extra == "all" - unitxt
[dev]; extra == "all" - unitxt
[docs]; extra == "all" - unitxt
[helm]; extra == "all" - unitxt
[service]; extra == "all" - unitxt
[tests]; extra == "all" - unitxt
[ui]; extra == "all" - unitxt
[watsonx]; extra == "all" - unitxt
[assistant]; extra == "all" - unitxt
[text2sql]; extra == "all"
🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking
Why Unitxt?
- 🌐 Comprehensive: Evaluate text, tables, vision, speech, and code in one unified framework
- 💼 Enterprise-Ready: Battle-tested components with extensive catalog of benchmarks
- 🧠 Model Agnostic: Works with HuggingFace, OpenAI, WatsonX, and custom models
- 🔒 Reproducible: Shareable, modular components ensure consistent results
Quick Links
Installation
pip install unitxt
Quick Start
Command Line Evaluation
# Simple evaluation
unitxt-evaluate \
--tasks "card=cards.mmlu_pro.engineering" \
--model cross_provider \
--model_args "model_name=llama-3-1-8b-instruct" \
--limit 10
# Multi-task evaluation
unitxt-evaluate \
--tasks "card=cards.text2sql.bird+card=cards.mmlu_pro.engineering" \
--model cross_provider \
--model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
--split test \
--limit 10 \
--output_path ./results/evaluate_cli \
--log_samples \
--apply_chat_template
# Benchmark evaluation
unitxt-evaluate \
--tasks "benchmarks.tool_calling" \
--model cross_provider \
--model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
--split test \
--limit 10 \
--output_path ./results/evaluate_cli \
--log_samples \
--apply_chat_template
Loading as Dataset
Load thousands of datasets in chat API format, ready for any model:
from unitxt import load_dataset
dataset = load_dataset(
card="cards.gpqa.diamond",
split="test",
format="formats.chat_api",
)
📊 Available on The Catalog
🚀 Interactive Dashboard
Launch the graphical user interface to explore datasets and benchmarks:
pip install unitxt[ui]
unitxt-explore
Complete Python Example
Evaluate your own data with any model:
# Import required components
from unitxt import evaluate, create_dataset
from unitxt.blocks import Task, InputOutputTemplate
from unitxt.inference import HFAutoModelInferenceEngine
# Question-answer dataset
data = [
{"question": "What is the capital of Texas?", "answer": "Austin"},
{"question": "What is the color of the sky?", "answer": "Blue"},
]
# Define the task and evaluation metric
task = Task(
input_fields={"question": str},
reference_fields={"answer": str},
prediction_type=str,
metrics=["metrics.accuracy"],
)
# Create a template to format inputs and outputs
template = InputOutputTemplate(
instruction="Answer the following question.",
input_format="{question}",
output_format="{answer}",
postprocessors=["processors.lower_case"],
)
# Prepare the dataset
dataset = create_dataset(
task=task,
template=template,
format="formats.chat_api",
test_set=data,
split="test",
)
# Set up the model (supports Hugging Face, WatsonX, OpenAI, etc.)
model = HFAutoModelInferenceEngine(
model_name="Qwen/Qwen1.5-0.5B-Chat", max_new_tokens=32
)
# Generate predictions and evaluate
predictions = model(dataset)
results = evaluate(predictions=predictions, data=dataset)
# Print results
print("Global Results:\n", results.global_scores.summary)
print("Instance Results:\n", results.instance_scores.summary)
Contributing
Read the contributing guide for details on how to contribute to Unitxt.
Citation
If you use Unitxt in your research, please cite our paper:
@inproceedings{bandel-etal-2024-unitxt,
title = "Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative {AI}",
author = "Bandel, Elron and
Perlitz, Yotam and
Venezian, Elad and
Friedman, Roni and
Arviv, Ofir and
Orbach, Matan and
Don-Yehiya, Shachar and
Sheinwald, Dafna and
Gera, Ariel and
Choshen, Leshem and
Shmueli-Scheuer, Michal and
Katz, Yoav",
editor = "Chang, Kai-Wei and
Lee, Annie and
Rajani, Nazneen",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-demo.21",
pages = "207--215",
}