Oven logo

Oven

arize-phoenix-evals

Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.

Phoenix's approach to LLM evals is notable for the following reasons:

  • Includes pre-tested templates and convenience functions for a set of common Eval "tasks"
  • Data science rigor applied to the testing of model and template combinations
  • Designed to run as fast as possible on batches of data
  • Includes benchmark datasets and tests for each eval function

Installation

Install the arize-phoenix-evals sub-package via pip

pip install arize-phoenix-evals

Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:

pip install 'openai>=1.0.0'

Usage

Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:

This example uses scikit-learn, so install it via pip

pip install scikit-learn
import os
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support

os.environ["OPENAI_API_KEY"] = "<your-openai-key>"

# Choose a model to evaluate on question-answering relevancy classification
model = OpenAIModel(
    model="o3-mini",
    temperature=0.0,
)

# Choose 100 examples from a small dataset of question-answer pairs
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
df = df.sample(100)
df = df.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)

# Use the language model to classify each example in the dataset
rails_map = RAG_RELEVANCY_PROMPT_RAILS_MAP
class_names = list(rails_map.values())
result_df = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, class_names)

# Map the true labels to the class names for comparison
y_true = df["relevant"].map(rails_map)
# Get the labels generated by the model being evaluated
y_pred = result_df["label"]

# Evaluate the classification results of the model
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, labels=class_names)
print("Classification Results:")
for idx, label in enumerate(class_names):
    print(f"Class: {label} (count: {support[idx]})")
    print(f"  Precision: {precision[idx]:.2f}")
    print(f"  Recall:    {recall[idx]:.2f}")
    print(f"  F1 Score:  {f1[idx]:.2f}\n")

To learn more about LLM Evals, see the LLM Evals documentation.