Model Evaluation Robustness
LLMs performance is sensitive to evaluation details. One of my previous co-workers’s work shows that for popular multiple choice question benchmarks (e.g. MMLU) minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions.
PAL for Math Reasoning
In PAL paper, the authors found that solving mathematical problems using external tools (Python interpreter) could greatly boost math reasoning performance.
...