evalplus0.3.1
Published
"EvalPlus for rigourous evaluation of LLM-synthesized code"
pip install evalplus
Package Downloads
Authors
Project URLs
Requires Python
>=3.9
Dependencies
- wget
>=3.2
- tempdir
>=0.7.1
- multipledispatch
>=0.6.0
- appdirs
>=1.4.4
- numpy
>=1.19.5
- tqdm
>=4.56.0
- termcolor
>=2.0.0
- fire
>=0.6.0
- openai
>=1.11.1
- tree-sitter
>=0.22.0
- tree-sitter-python
>=0.21.0
- rich
>=12.3.0
- transformers
>=4.43.0
- stop-sequencer
>=1.2.3
- anthropic
>=0.34.1
- google-generativeai
>=0.7.2
- datasets
>=2.21.0
- psutil
>=5.9.0
- Pympler
>=1.0.1; extra == "perf"
- cirron
>=0.4; extra == "perf"
- vllm
>=0.5.1; extra == "vllm"
EvalPlus(📖) => 📚
📰News • 🔥Quick Start • 🚀LLM Backends • 📚Documents • 📜Citation • 🙏Acknowledgement
About
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
- ✨ HumanEval+: 80x more tests than the original HumanEval!
- ✨ MBPP+: 35x more tests than the original MBPP!
- ✨ EvalPerf: evaluating the efficiency of LLM-generated code!
- ✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.
Why EvalPlus?
- ✨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
- ✨ Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.
- ✨ Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.
Want to know more details? Read our papers & materials!
- EvalPlus: NeurIPS'23 paper, Google Slides, Poster
- EvalPerf: COLM'24 paper, Poster, Documentation
📰 News
Below tracks the notable updates of EvalPlus:
- [2024-10-20
v0.3.1
]: EvalPlusv0.3.1
is officially released! Release highlights includes (i) Code efficiency evaluation via EvalPerf, (ii) one command to run the whole pipline (generation + post-processing + evaluation), (iii) support for more inference backends such as Google Gemini & Anthropic, etc. - [2024-06-09 pre
v0.3.0
]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena. - [2024-04-17 pre
v0.3.0
]: MBPP+ is upgraded tov0.2.0
by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected. - Earlier:
- (
v0.2.1
) You can use EvalPlus datasets via bigcode-evaluation-harness! HumanEval+ oracle fixes (32). - (
v0.2.0
) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160). - (
v0.1.7
) Leaderboard release; HumanEval+ contract and input fixes (32/166/126/6) - (
v0.1.6
) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140) - (
v0.1.5
) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples! - (
v0.1.1
) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc. - (
v0.1.0
) HumanEval+ is released!
- (
🔥 Quick Start
- Code correctness evaluation: HumanEval(+) or MBPP(+)
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--greedy
Code execution within Docker :: click to expand ::
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset humaneval \
--backend vllm \
--greedy
# Code execution within Docker
docker run --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
evalplus.evaluate --dataset humaneval \
--samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl
- Code efficiency evaluation: EvalPerf (*nix only)
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--backend vllm
Code execution within Docker :: click to expand ::
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset evalperf \
--backend vllm \
--temperture 1.0 \
--n-samples 100
# Code execution within Docker
docker run --cap-add PERFMON --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
evalplus.evalperf --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl
🚀 LLM Backends
HuggingFace models
transformers
backend:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--greedy
[!Note]
EvalPlus uses different prompts for base and chat models. By default it is detected by
tokenizer.chat_template
when usinghf
/vllm
as backend. For other backends, only chat mode is allowed.Therefore, if your base models come with a
tokenizer.chat_template
, please add--force-base-prompt
to avoid being evaluated in a chat mode.
Enable Flash Attention 2 :: click to expand ::
# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases
# Run evaluation with FA2
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--attn-implementation [flash_attention_2|sdpa] \
--greedy
vllm
backend:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--tp [TENSOR_PARALLEL_SIZE] \
--greedy
openai
compatible servers (e.g., vLLM):
# Launch a model server first: e.g., https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend openai \
--base-url http://localhost:8000/v1 \
--greedy
OpenAI models
- Access OpenAI APIs from OpenAI Console
export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o" \
--dataset [humaneval|mbpp] \
--backend openai \
--greedy
Anthropic models
- Access Anthropic APIs from Anthropic Console
export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
--dataset [humaneval|mbpp] \
--backend anthropic \
--greedy
Google Gemini models
- Access Gemini APIs from Google AI Studio
export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro" \
--dataset [humaneval|mbpp] \
--backend google \
--greedy
You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/
⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
📚 Documents
To learn more about how to use EvalPlus, please refer to:
📜 Citation
@inproceedings{evalplus,
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
year = {2023},
url = {https://openreview.net/forum?id=1qvx610Cu7},
}
@inproceedings{evalperf,
title = {Evaluating Language Models for Efficient Code Generation},
author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
booktitle = {First Conference on Language Modeling},
year = {2024},
url = {https://openreview.net/forum?id=IBCBMeAhmC},
}