Model Evaluation

Model Evaluation Robustness

LLMs performance is sensitive to evaluation details. One of my previous co-workers’s work shows that for popular multiple choice question benchmarks (e.g. MMLU) minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions.

PAL for Math Reasoning

In PAL paper, the authors found that solving mathematical problems using external tools (Python interpreter) could greatly boost math reasoning performance.

References

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy

Model Evaluation Robustness#

PAL for Math Reasoning#

References#

Model Evaluation Robustness

PAL for Math Reasoning

References