Home AILeanstral 1.5: Evidence abundance for everyone

Leanstral 1.5: Evidence abundance for everyone

by OmarAli
Leanstral 1.5: Evidence abundance for everyone

Since its inception, Leanstral has provided an open, practical approach to proof engineering in Lean 4. Today we are releasing Leanstral 1.5a free Apache 2.0 licensed model with a total of 119 B and only 6 B active parameters, providing a performance boost that makes formal verification more powerful and accessible than ever before.

Leanstral 1.5 saturates miniF2F, solves 587/672 PutnamBench issues, and reaches a new state of the art %87 on FATE-H And 34% on FATE-X. Beyond benchmarks, it checks complex code properties and uncovers previously unknown bugs in open source repositories– and proves that rigorous formal methods can be both effective and practical for real-world use.

Leanstral 15 Evidence abundance for everyone

banner

Leanstral training

Leanstral 1.5 goes through a three-step process: intermediate training, supervised fine-tuning, and reinforcement learning with CISPO. Leanstral 1.5 leverages extensive training in two RL environments:

In the Multiturn environmentThe model receives a theorem statement and must either prove or disprove it. The model submits a proof, receives feedback from the Lean compiler, and refines its approach with each attempt. If the proof is successful; Otherwise, the loop continues until the model either solves the problem or exhausts its budget.

1783235436 488 Leanstral 15 Evidence abundance for everyone

In the Code agent environmentLeanstral works like a developer in a raw file system: it edits files, executes bash commands, and uses the Lean language server to check targets, errors, and type information in real time. This allows it to handle long-term tasks such as completing partial proofs in a repository, creating auxiliary lemmas, and persisting over multiple rounds of context compression. The model learns to navigate through the entire proof engineering workflow and is ultimately verified for correctness by our SafeVerify fork against a list of target theorems.

1783235436 890 Leanstral 15 Evidence abundance for everyone

Evaluation

We evaluate Leanstral using the following benchmarks:

  • miniF2F is a cross-system benchmark for formal mathematics, ranging from elementary problems to IMO-level challenges, testing various proof skills in algebra, combinatorics and number theory.

  • Putnam Bank consists of 672 problems from the Putnam Mathematical Competition that require deep thinking and long chains of reasoning to solve challenging mathematical problems.

  • FATE-H and FATE-X are abstract algebra benchmarks for graduate-level and doctoral-level problems that test advanced thinking in areas such as group theory, ring theory, and module theory.

  • FLTEval is based on real pull requests from the Fermat’s Last Theorem repository and tests practical proof techniques with real-world complexity.

We fully saturate miniF2F and achieve 100% on both validation and test sets. On PutnamBench and FATE-H/X we compare Leanstral 1.5 with Goedel-Architect without natural language guidance, Seed-Prover 1.5 in its high setting, and AxProverBase. Leanstral reaches a new state of the art with FATE-H/X and solves 87 and 34 problems respectively. On PutnamBench, it outperforms Seed-Prover 1.5 High by 7 problems at a much lower cost: about $4 per problem, versus an estimated $300 or more for Seed-Prover, whose high setting runs on a budget of 10 H20 days per problem. The only higher-ranked provers work under different conditions – some get natural language proof, others cost far more to run, like Aleph Prover at $54-$68 per problem.

Leanstral 1.5 shows the strongest test time scaling we’ve seen in a formal reasoning model. The following figure shows Pass@8 on PutnamBench as we increase the token budget per trial from 25,000 to 4M: Performance increases smoothly and monotonically throughout the process, from 44 at 50,000 problems solved to 244 at 200,000, 493 at 1M, and 587 at 4M. Rather than giving up when a proof takes a long time, Leanstral further argues, edits files and reworks across millions of tokens and translates budget directly into solved problems – the same behavior behind the AVL tree proof below, which ran across 2.7 million tokens in 22 compactions.

1783235436 252 Leanstral 15 Evidence abundance for everyone

With this version we are also making FLTEval fully available as an open source solution. Leanstral 1.5 increases the test average on 1 from 21.9 to 28.9 and on 8 from 31.9 to 43.2, beating Opus 4.6’s 39.6 at a seventh of the cost. It also increases its lead over open source models three to ten times larger, as shown in the figure below.

1783235436 300 Leanstral 15 Evidence abundance for everyone

Code review case studies

Although Leanstral 1.5 is primarily designed for mathematics, it has strong code review capabilities. We present two critical case studies to demonstrate impact.

AVL trees: Proving time complexity

AVL trees are self-balancing binary search trees that maintain O(log n) height by rebalancing insertions and deletions. Leanstral 1.5 proved these time complexity guarantees for a real implementation – a task that required structural induction to mirror the recursive structure of the tree, careful handling of monadic time tracking, and extensive case analysis to realign the paths. Over 2.7 million tokens and 22 compactions, Leanstral systematically unfolded each layer of the TimeM monad, exposing the underlying computations even though they were nested within the control flow. A near-tight limit of 48 steps per unit height plus a constant for insertion was set, and then height was linked to tree size via a logarithmic relationship. This provided complete, verified evidence that insertion and deletion are indeed O(log n).

Bug Discovery: Finding hidden bugs

To test Leanstral’s error detection capabilities, we created an automated pipeline: Aeneas translates Rust code into Lean, while Leanstral infers user intent and generates correctness properties from the code. Leanstral then attempts to prove each property in four trials. If they all fail, an attempt is made to prove the negation instead, also with four attempts. Across 57 repositories tested, this process reported 47 broken properties, 11 of which indicated real bugs – five of which were not previously reported on GitHub.

One such bug was in the zigzag decoding sign function of the datrs/varinteger library. When typing Std.U64.MAX, the expression (value + 1) overflowed, resulting in crashes in debug mode and silent corruption in release mode – an edge case that would normally be missed during testing and fuzzing. Leanstral’s pipeline automatically detected it and demonstrated that formal verification can already be applied to real codebases and finds bugs missed by some traditional methods.

Get started

Leanstral 1.5 has one Apache 2.0 License. The weights can be found on Huggingface and are now also available as a free API endpoint leanstral-1-5. We recommend using it in Mistral Vibe. To start your journey, get an API key and:

1. Set up Mistral Vibe

uv tool install mistral-vibe

uv tool update mistral-vibe

2. Install Leanstral 1.5

3. Start the agent

4. Install Lean LSP MCP (optional)

It is highly recommended to install Lean LSP MCP by adding the following to your ~/.vibe/config.toml

If there are no MCP servers, you may need to remove them mcp_servers = [].

5. Start proving

Ask Leanstral to tackle a theorem, debug a proof, or contribute to a repository. It’s that simple.

https://mistral.ai/news/leanstral-1-5/

Viral Trends

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More