arize-phoenix-evals0.21.0
Published
LLM Evaluations
pip install arize-phoenix-evals
Package Downloads
Authors
Project URLs
Requires Python
<3.14,>=3.8
Dependencies
- pandas
- tqdm
- typing-extensions
<5,>=4.5
- anthropic
>0.18.0; extra == "dev"
- boto3
; extra == "dev"
- litellm
>=1.28.9; extra == "dev"
- mistralai
>=1.0.0; extra == "dev"
- openai
>=1.0.0; extra == "dev"
- vertexai
; extra == "dev"
- anthropic
>=0.18.0; extra == "test"
- boto3
; extra == "test"
- lameenc
; extra == "test"
- litellm
>=1.28.9; extra == "test"
- mistralai
>=1.0.0; extra == "test"
- nest-asyncio
; extra == "test"
- openai
>=1.0.0; extra == "test"
- openinference-semantic-conventions
; extra == "test"
- pandas
; extra == "test"
- pandas-stubs
<=2.0.2.230605; extra == "test"
- respx
; extra == "test"
- tqdm
; extra == "test"
- types-tqdm
; extra == "test"
- typing-extensions
<5,>=4.5; extra == "test"
- vertexai
; extra == "test"
arize-phoenix-evals
Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.
Phoenix's approach to LLM evals is notable for the following reasons:
- Includes pre-tested templates and convenience functions for a set of common Eval "tasks"
- Data science rigor applied to the testing of model and template combinations
- Designed to run as fast as possible on batches of data
- Includes benchmark datasets and tests for each eval function
Installation
Install the arize-phoenix-evals sub-package via pip
pip install arize-phoenix-evals
Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:
pip install 'openai>=1.0.0'
Usage
Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:
This example uses scikit-learn, so install it via pip
pip install scikit-learn
import os
from phoenix.evals import (
RAG_RELEVANCY_PROMPT_TEMPLATE,
RAG_RELEVANCY_PROMPT_RAILS_MAP,
OpenAIModel,
download_benchmark_dataset,
llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support
os.environ["OPENAI_API_KEY"] = "<your-openai-key>"
# Choose a model to evaluate on question-answering relevancy classification
model = OpenAIModel(
model="o3-mini",
temperature=0.0,
)
# Choose 100 examples from a small dataset of question-answer pairs
df = download_benchmark_dataset(
task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
df = df.sample(100)
df = df.rename(
columns={
"query_text": "input",
"document_text": "reference",
},
)
# Use the language model to classify each example in the dataset
rails_map = RAG_RELEVANCY_PROMPT_RAILS_MAP
class_names = list(rails_map.values())
result_df = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, class_names)
# Map the true labels to the class names for comparison
y_true = df["relevant"].map(rails_map)
# Get the labels generated by the model being evaluated
y_pred = result_df["label"]
# Evaluate the classification results of the model
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, labels=class_names)
print("Classification Results:")
for idx, label in enumerate(class_names):
print(f"Class: {label} (count: {support[idx]})")
print(f" Precision: {precision[idx]:.2f}")
print(f" Recall: {recall[idx]:.2f}")
print(f" F1 Score: {f1[idx]:.2f}\n")
To learn more about LLM Evals, see the LLM Evals documentation.