Model Evaluation Robustness

LLMs performance is sensitive to evaluation details. One of my previous co-workers’s work shows that for popular multiple choice question benchmarks (e.g. MMLU) minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions.

PAL for Math Reasoning

In PAL paper, the authors found that solving mathematical problems using external tools (Python interpreter) could greatly boost math reasoning performance.

References

  1. When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
  2. Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy