浙江大学数据科学研究中心- 博士生讨论班2025[15]

教育教学

博士生讨论班2025[15]

作者：

时间：2025-05-26

阅读量：3432次

演讲人：李祎哲
时间：2025年5月27日14:00
地点：浙江大学紫金港校区行政楼1417报告厅

报告文章：Benchmarking Formal Mathematical Reasoning of Large Language Models

摘要：Formalized mathematical reasoning constitutes one of the fundamental challenges in the domain of artificial intelligence. We introduce FormalMATH: a comprehensive, large-scale benchmark specifically designed for formalized mathematical reasoning. This benchmark encompasses 5,560 mathematically rigorous propositions validated through the Lean4 compiler, spanning 12 distinct subfields such as algebra, number theory, calculus, and discrete mathematics. To overcome the limitations of conventional formalized data that predominantly depends on expert manual annotation, we have developed an innovative "three-stage filtering" framework. Evaluation results on the complete FormalMATH dataset reveal that the performance of mainstream Large Language Model (LLM) provers significantly underperforms expectations, exhibiting notable domain bias and a tendency to inappropriately employ automated strategies by attempting to substitute multi-step reasoning processes with single-step solutions. A particularly noteworthy observation has emerged: within the chain-of-thought (CoT) paradigm, the provision of natural language problem-solving approaches paradoxically diminishes the success rate of proof completion.

上一篇: 博士生讨论班2025[14]

下一篇: 博士生讨论班2025[16]

首页

中心概况

新闻中心

学术交流

科学研究

教育教学

招聘信息

综合服务

联系我们

研究生活动