博士生讨论班2025[15]
时间:2025-05-26
阅读量:124次
- 演讲人: 李祎哲
- 时间:2025年5月27日14:00
- 地点:浙江大学紫金港校区行政楼1417报告厅
报告文章:Benchmarking Formal Mathematical Reasoning of Large Language Models
摘要:Formalized mathematical reasoning constitutes one of the fundamental challenges in the domain of artificial intelligence. We introduce FormalMATH: a comprehensive, large-scale benchmark specifically designed for formalized mathematical reasoning. This benchmark encompasses 5,560 mathematically rigorous propositions validated through the Lean4 compiler, spanning 12 distinct subfields such as algebra, number theory, calculus, and discrete mathematics. To overcome the limitations of conventional formalized data that predominantly depends on expert manual annotation, we have developed an innovative "three-stage filtering" framework. Evaluation results on the complete FormalMATH dataset reveal that the performance of mainstream Large Language Model (LLM) provers significantly underperforms expectations, exhibiting notable domain bias and a tendency to inappropriately employ automated strategies by attempting to substitute multi-step reasoning processes with single-step solutions. A particularly noteworthy observation has emerged: within the chain-of-thought (CoT) paradigm, the provision of natural language problem-solving approaches paradoxically diminishes the success rate of proof completion.