DeepSeek: Advancing theorem proving in LLMs through large-scale synthetic data

186 points by hhs 9 months ago

youoy 9 months ago

From the abstract:

> Proof assistants like Lean have revolutionized mathematical proof verification... > Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.

Please don't use "revolutionised", "promising" on a scientific abstract... If it has revolutionised, tell me instead what was the important thing that made it happen. If it is promising, tell me instead why.

Appart from that, the second sentence says:

> Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data.

Is it really? Of hindered by unsuitable model architecture? Because you could perefectly say that Lean is an AI capable of theorem proving... I feel that that sentence should be the potential conclusion of the paper, not the first observation. Like, "we used more training data and it works better, if we scale theorem proving training data, we can have the performance that we want, so training data is actually hindering the advancement in formal theorem proving on LLMs".

versteegen 9 months ago

> their advancement in formal theorem proving is hindered by a lack of training data
'Formal' which here means inputs to a proof verifier/proof assistant, and I think most people would accept it as a given that by the extremely data-hungry standards of LLMs there is a severe lack of such data, and that this is a severe hindrance to getting an LLM to generalise. The question is how they might scale with more training data.
- Loic 9 months ago
  
  Formally proving with a system based on probabilities (temperature, etc.) is for me an oxymoron. But I am not a mathematician.
  Edit: Ok, they generate proofs with the LLM, then they validate them with the Lean 4 environment. I am not sure if this is better than generating cooking recipes with a LLM and asking a cook if they make sense or not.
  - nrfulton 9 months ago
    
    > Formally proving with a system based on probabilities (temperature, etc.) is for me an oxymoron. But I am not a mathematician.
    In the context of automated theorem proving, proof search and proof checking could not be more different.
    Proof search is undecidable for most logics. For that reason, heuristics are commonly used to guide proof search. This has always been the case; heuristic proof search predates the use of LLMs for theorem proving by decades.
    Proof checking, on the other hand, is not just decidable but usually quite tractable and cheap. Therefore, proof checkers are typically deterministic and fast. Ideally the checker is also implemented in small amount of relatively simple and heavily audited code.
    For a high-level description of how provers are typically architected, see https://www.pls-lab.org/en/de_Bruijn_criterion (or https://lawrencecpaulson.github.io/2022/01/05/LCF.html which is linked to at the bottom of the above page and goes into a bit more historical detail and nuance).
    > I am not sure if this is better than generating cooking recipes with a LLM and asking a cook if they make sense or not.
    Conceptually it's not at all similar to asking a cook if a recipe is "correct". For lean proofs, "correct" has an entirely precise meaning. The "cook" (i.e., proof checker) is basically free (much less than $0.01/hr). The cook (proof checker) takes at most a few seconds to read and respond to any particular recipe. And the cook (i.e., checker) never makes mistakes.
    The relationship between asking a cook if a recipe is good and asking Lean to check a proof is approximately the difference between asking a mediocre human computer to compute a derivative and asking a modern mechanical computer to do the same (using a classical algorithm for doing so, not an LLM; no one is using LLMs to check proofs -- there's really no point to doing so, because the checking problem is easy).
    
    Loic 9 months ago
    
    Thank you for this very detailed answer.
77pt77 9 months ago

LLMs bullshit like their "life" depends on it.
For math that is a fatal flaw.

maxrmk 9 months ago

There's a newer version of this model that takes a really cool RL based approach: https://arxiv.org/pdf/2408.08152

_flux 9 months ago

This must be one of the best applications for LLMs, as you can always automatically verify the results, or reject them otherwise, right?

aithrowawaycomm 9 months ago

The dependence on huge amounts of synthetic data makes it hard for me to see an application for mathematicians. The theorems being proved here are incredibly boring, more like calculation-heavy homework problems from a professor who is mean to freshmen. (This is not the same thing as what Terence Tao has been looking at, which is more human language -> Lean "translation.") There is a basic problem that new and interesting mathematics does not have the data for ANNs to learn from; this mathematics is very old and there are thousands of examples online. To be extremely pedantic the authors didn't show DeepSeek was good at proofs, but rather at undergraduate competition math problems. I am assuming some of that transfers... but I am a huge critic of LLMs and even so I constantly overestimate them!
However I could see this being useful for verified software development, which usually involves a huge amount of tedious and uninteresting lemmas, and the size of the proofs becomes exhausting. Having an LLM check all 2^8 configurations of something seems worthwhile.
danielmarkbruce 9 months ago

Any situation involving the ability to simulate (like, weather models, bio models, physics models) is potentially a good application. It's still hard to make a good model to approximate an extremely complex function.
yuhfdr 9 months ago

Same with code generation.
But generating useless code, or proofs, just to discard them is hardly a consequence and externality free effort.
- kkzz99 9 months ago
  
  But you can far more easily scale llm generation compared to human researchers.
  - yuhfdr 9 months ago
    
    Sloppyjoes always assume the best case for llms and worst case for their counter example of choice.
    10 trillion llms, powered by a Dyson sphere, that still output slop is still worse than one unappreciated post doc.
    
    bluechair 9 months ago
    
    That was funny even if I don’t agree with the sentiment.
UltraSane 9 months ago

[dead]

aabhay 9 months ago

The ability to use automatic verification + synthetic data is basically common knowledge among practitioners. But all these organizations have also explored endlessly the different ways to overfit on such data and the conclusion is the same -- the current model architecture seems to plateau when it comes to multi-step logical reasoning. You either drift from your common knowledge pre-training too far or you never come up with the right steps in instances where there's a vast design space.

Think -- why has nobody been able to make an LLM play Go better than AlphaZero while still retaining language capabilities? It certainly would have orders of magnitude more parameters.

llm_trw 9 months ago

Transformers are fundamentally linear, proofs are either tree like or dag like. You'd need a new attention mechanism to allow hierarchical reasoning.
- disconcision 9 months ago
  
  not obviously the case. natural language also has tree/dag structure yet transformers do okay with it. encoding higher order structure in linear sequences seems necessary for transformers to be able to do most anything
  - llm_trw 9 months ago
    
    Not to arbitrary depth. Empirically the majority of text has depth less than 6 when represented as a tree. Proofs on the other hand have no a priori limits to their depth.
    
    looofooo0 9 months ago
    
    But isn't math strucutured in the same way subdivided in lemmas, theorems, corollarys with limited depth.
    
    llm_trw 9 months ago
    
    No. It's pretty common to have depth 20+ on even relatively simple proofs when you start looking at decomposing proofs to their axioms.
    The old joke about 1 + 1 = 2 being on page 360 of Principia Mathematica still largely holds.
    
    auggierose 9 months ago
    
    That depends on the granularity of your proof steps. That old joke doesn't hold for a very long time now (take a look at Isabelle/Isar proofs), yet people keep bringing it up.
    
    llm_trw 9 months ago
    
    Doesn't matter what the granularity of your proof step is you still need to understand what that proof step does to be a mathematician. You seem to mistake syntactic brevity for semantic meaning.
    
    auggierose 9 months ago
    
    Mathematicians do very large proof steps, and so can the computer these days. If you do something like (proof by auto: thm_1 thm_2 thm_3) and it goes through, then I know why the proof works (because of these theorems), and this is similar to how a mathematician understands something. Syntactic brevity is indeed an indicator that you understand the proof. The better you understand something, the more concise you can formulate and prove it.
danielmarkbruce 9 months ago

AlphaZero is a system including models and search capabilities. This isn't a great example.
- michaelnny 9 months ago
  
  One important aspect of the success of AlphaGo and its successor is the game environment is closed domain, and has a stable reward function. With this we can guide the agent to do MCTS search and planning for the best move in every state.
  However, such reward system is not available for LLM in an open domain setting.
- aabhay 9 months ago
  
  AlphaZero is not an llm though? Its primarily a convolutional network with some additional fully connected layers that guide an MCTS at inference time.
  - danielmarkbruce 9 months ago
    
    That too! It's definitely not an LLM. It would be a bad architecture choice (almost certainly...). For math an LLM would be a good choice. Arbitrary length input, sequential data, unclear where/which symbols to pay most attention to. Math notation is a human built language just like english.
    Search is a great tool for AI, but you have to have a reasonable search space (like chess or go or poker). I'm not close enough to Lean to understand if it can do something like that (i believe not), but math doesn't feel like a thing where you can reasonably "search" next steps, for most problems.
    
    danenania 9 months ago
    
    > It would be a bad architecture choice (almost certainly...)
    Naively, it would seem like transformers could line up nicely with turn-based games. Instead of mapping tokens to language as in an LLM, they could map to valid moves given the current game state. And then instead of optimizing the next token for linguistic coherence as LLMs do, you optimize for winning the game.
    
    danielmarkbruce 9 months ago
    
    A lot of the games usually used are markov. All the state is right there in front of you, doesn't matter how you got there. As an example - chess - it matters not how you got to state X (... someone will get the edge cases..).

uptownfunk 9 months ago

Solutions to major mathematics problems generally require the creation of entirely new objects and the appropriate machinery to analyze them. The behavior of said objects generally ends up providing the solution to the original question as a special case. A new architecture is required but one that is probably not far off from the original transformer in nature.

arthurcolle 9 months ago

sounds exactly like setting up pydantic models

whyowhy3484939 9 months ago

"Suppose you try to construct a coherent, ordered, natural world with no resource other than repeated exposure to things, and the formation of certain associative bonds. Oh, please!"

This is prof. Robinson on Kantian philosophy - check out Oxford podcasts by the way - and this quote is meant to imply that building a coherent world out of raw sensory data and statistics alone is completely and utterly impractical if not outright impossible. While I don't think he meant to refer to any kind of AI, in my mind this description also aptly describes the general method of DL neural networks. Repeated exposure to find correlation.

How does one find order through associativity alone? With AI this is not an academic problem anymore. This has become practical. Kant says it is impossible, not just unlikely.

The Kantian project and the various core issues it tries to address seems readily applicable to AI research yet I see very little mention of it. Perhaps I am just dumb though. Building a mind capable of taming tremendous sensory flux needs to, at the very least, take note of the (many) fundamental issues he raised. Issues I feel are not at all trivial to set aside. I feel we are stuck in Hume's empiricist reasoning and have yet to graduate to Kant and beyond.

Are we now somehow convinced yet again that causality and reasoning will, in fact, after all spontaneously emerge out of pure chaos? Didn't we settle the impossibility of this a few hundred years ago?

viraptor 9 months ago

The philosophy angle is interesting of course, but are any of those claims proven true? Why would someone stop trying to achieve something just because Kant's view of the world says it's impossible? Philosophies come and go and get refined over time. Meanwhile you only need to find one edge case where they don't apply the way Kant imagined it. Or find an area where the claim is moot in practice because you achieved all your goals anyway.
- llm_trw 9 months ago
  
  It's not so much that we should stop looking at problems because Kant said it's impossible. Rather that we should read what previous generations thought about this because for the first time in history we have a practical way to scientifically test a wide selection of the philosophy of mind.
- whyowhy3484939 9 months ago
  
  I can appreciate this very practical stance and I naturally urge all my technical colleagues to persist in their struggle for pragmatic victory, but I can't help but voice some concerns that at the very least may lead to an illuminating response capable of at long last disabusing me of my critical notions. You may imagine similar concerns would perhaps arise in you if massive societal resources were to be invested in finding the "next biggest number" because a multitude of people have decided math can't be trusted or some loophole is thought to be found through empirical effort alone.
  Hume's reasoning on this particular issue, and I am taking liberties here, boils down to the idea that anything can cause anything and there is no necessary connection between anything. At least, no connection we would be able to gather with our senses. The causal, necessary, connection between one billiard ball causing a second ball to move is not to be found anywhere in raw sensory data itself. You will not find a "third element", the "causal relationship", anywhere. There is just raw sensory data, one ball coming from the left, two balls besides each other and then one ball moving away to the right. The idea that one ball caused the other to move is made up. It is fiction. It is, at best, a habit of the mind to find those sort of correlations and label them as "causal". I dare you to find a flaw in that argument. As convincing as it is, it is pretty damning for any enterprise that wants to call itself scientific or even rational. Nothing we will experience, nothing we will ever think up, no matter how sophisticated, will, on a fundamental level, ever amount to anything more than "more or less probable".
  This famously awoke Kant from his "dogmatic slumber". Luckily for him he found some problems in Hume's argument, and again I am taking liberties, because to entertain even the idea of an external world filled with objects like billiard balls presupposes the existence of tiny, slightly important things like, oh I don't know, time and space itself. Hume, where do you pull these from? You can look in raw sensory for evidence of time and space for a long time and, like looking for causality, you'll come up empty-handed, unless, and here is the point, you bring those notions with you and "wear those glasses", so to speak. You massage the data so it will fit the spatio-temporal domain and now you can start making sense of it and not a figurative second sooner.
  There are all sorts of parallels here with problems in AI (IMO). Neural networks are asked to infer concepts like time, space and causality by just looking at a lot of data and I can't help but be skeptical of success. The interesting thing to me here is that AI has made these dry and academic philosophical debates practical and useful. Hume talks about billiard balls, but it is easy to convert this into ML lingo by considering, say, some excitation of an array of artificial neurons that is followed by another configuration of excitation. What is their connection? How will you ever unearth "causality" from these completely unconnected events? Nothing about this problem has changed its nature at all in the past few hundred years.
  If "causality" or "necessary connection" is too abstract for your taste, consider that to, say, have any type of memory mechanism at all you have to have some sort of sensory apparatus - say, a neuron or some multitude of them - that is capable of holding, say, event A and some unit of time later, event B and can then connect those two by assigning a probability of some kind between them. Is there any other way? Can you build memory without using a mechanism vaguely of this kind? But notice you are bringing the notion of the temporal to the data instead of the other way around. Nothing about event A or event B can tell you what the nature of time is. You bring it inside your sensor which has a "before" and "after" slot. Kant would say "Aha! There it is. You could not find anything in the data so you had to preprocess it in order to make it intelligible", but he would do it in dense, long-winded, inscrutable German. (He'd probably make fun of you without you knowing it as well.)
  It is through the nature of this, in our case temporal, sensor that any kind of temporal connection can be made, not through the data itself. That is quite something and I am having a hard time refuting this line of reasoning. If you need more than space, time and causality you can consider the problem of "substance": how will you keep track of an object that alters its appearance? How do you "know" that some entity is merely changing appearance by, say changing clothes or moving through a dark spot and is thus dimly lit all of a suddenly, but is "essentially" the same? What's this "essentially"? How much of an sensory impression can change before it is a "different entity"? This problem has the same character as the temporal and causal problem. The data itself will not be illuminating unless you bring "substance glasses" with you.
  Strong AI might be found implementing Kantian category sensors like Unity, Plurality, Causality, Substance, etc. A guy can dream right.
  - mrybczyn 9 months ago
    
    A fantastic view on the nature of inference and the physical world! What I appreciate most about our LLM adventures in the past 5 years, is that it's finally:
    1. Plausible to discuss, and find hints of what "AI" might mean - ("AGI" or "Strong AI").
    2. Dig up all those crusty and dusty notions of philosophers long dead, and some nerding out reads from my teenage years. And try to apply them to an interesting system to study empirically, instead of just humanistic pondering.
    3. Same with psychology! Now that our pattern matching systems are starting to pass various turing tests, what does that mean for "sentience" and "sapience" and should we all become armchair psychologists as we teach our systems to understand and act in the world? Recall the first part of Oddyssey 2010; where the AI researcher is more of a child psychologist...
    Lastly - LLMs and "glimpses of AI" are something new! Not the same old same old recycled tech and popculture. Truly something new under the sun. Good times.
  - ahkpasp 9 months ago
    
    I think today's AI researchers' preoccupation with just learning directly from data in an effort to achieve AGI is understandable, because this data-driven approach has generated results that are far more impressive than all the previous approaches that involved humans implementing concepts / reverse-engineering how the human mind works in order to create a truly intelligent machine (see Sutton's Bitter Lesson http://www.incompleteideas.net/IncIdeas/BitterLesson.html). Because of the demonstrated success of these systems, a popular hypothesis is that maybe increasingly large amounts of data and computation is enough to solve all the remaining problems in AI.
    Can abstract concepts such as causality be "learned" directly from data? It's an important question, but answering such a question might not be necessary to train an AI model to make good predictions. It might be enough to show an AI model videos of billiard games, then to pause at random points in those videos and ask the model to predict what would happen next. (For example, the video could be paused at the very moment one billiard ball hit a second billiard ball.) If the AI model manages to correctly predict what the very next frame looks like (the second billiard ball starts to move), and it produces similar predictions in analogous situations of things colliding with other things, then it arguably "understands" this form of causality. If it fails to predict the next frame, the neural network training algorithm can adjust the neural network's neuron weights such that it will be inclined to make the correct prediction in the future. Over the course of many of these predictions and adjustments, over thousands or millions of hours of video, the expectation is that a system will emerge that is capable of making correct predictions in a variety of situations, as that means it has managed to develop general rules or "understand" causality.
    As the AI developer I wouldn't even have to be aware of e.g. why a billiard ball could cause another to move, because if my neural network can make accurate predictions in such situations, it's useful and demonstrates understanding. Such an understanding may in fact require the neural network to internally represent such concepts as causality / laws of physics (as your discussion implies), but I would not have to concern myself with the details of how the neural network actually perceives the world, if the predictions are consistent and reliable. With large amounts of data and computation, the neural network weights could be tuned to create a network that is able to make good predictions in any domain at all. That's the idea anyway.
    Despite the optimism I think that there's something missing. There are simpler problem domains, like SimpleLogic (https://arxiv.org/abs/2205.11502), where researchers show that learning logical reasoning from data is apparently impossible in practice, because various spurious statistical correlations in data strongly encourage ML systems to learn incorrect approaches to solving logic problems, sort of like how using incorrect methods can sometimes still give you the correct answer in maths problems. The problem here is that the incorrect method might give valid answers in some situations, but will absolutely give wrong answers in others. I think that this issue is a severe and possibly insurmountable flaw of the current approach to AI, despite the impressive results we've seen so far, because real-world data such as video, images, and text, is far messier than artificial examples used in this research paper. But who knows what the future will bring?
    
    whyowhy3484939 9 months ago
    
    The current generation's interest is indeed quite understandably focused on, pardon the simplification, data-driven correlation engines and the results have been nothing but spectacular. Let it be clear that I do not reject the idea(s) behind the bitter lesson. On the contrary, I think it is spot on, yet I feel there is in a great many instances a false, mostly unspoken, dichotomy being imagined behind those words: there is either A) raw probabilistic association lacking any and all structure or B) full scale good old fashioned symbolic AI containing nothing but highly specific, human, knowledge.
    The text of the bitter lesson itself provides an interesting case, on the topic of computer vision: "Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better."
    To me this proves the necessity of proper measurements devices - in this case "convolution" and "certain kinds of invariances" - to render data as intelligible as possible utilizing as little moving parts as required. What I don't see in this is: "we need to drop all preexisting notions of structure and just focus on finding pure correlation only". The fact that we have moved to more fundamental search operations - from "cylinder" to "convolution" - does not IMO fundamentally alter the general strategy. In short, I think the lesson is being stretched beyond its limits.
    The lesson shows us: "We want AI agents that can discover like we can, not which contain what we have discovered" and I couldn't agree more, yet we seems to disagree on what it means to "discover like we can". I happen to believe that raw correlation is just one of the key ingredients. In fact Kant has the notion of "community" that, and I might be sorely mistaken here, appears to capture the associative power of cognition in much the same way.
    I, of course, lack any sort of credentials to further my point. I only imagine it might possibly be of slight interest to someone to squint upon this topic from a skewed angle.
    PS: Your point on pragmatism is well taken. I'm sure we are all quite excited to see what happens over the next couple of years, yet I cannot help but hear an ominous violin crescendo beneath your "if the predictions are consistent and reliable".
    
    QuesnayJr 9 months ago
    
    There has been considerable progress since Sutton wrote. The latest models don't even need convolutions and certain kinds of invariances. Visual Transformer models don't even necessarily know which pixels are near each other. They figure it out through correlations.
    I was skeptical of the statistical approach myself, but I think it's time for some humility in the face of its wild success.
ilaksh 9 months ago

Science came along and made much of philosophy obsolete, especially the parts that are now directly addressed by science and engineering projects.
For your comment to make sense it would have to include some context of computer science and artificial intelligence research, a significant amount of which is directly applicable to the topic you mention.
It's not that you are dumb, you are just being willfully ignorant. I'm not an expert on machine learning but I do know enough to say confidently that there is a lot of research about building world models from sensory data.
- ahkpasp 9 months ago
  
  > Science came along and made much of philosophy obsolete, especially the parts that are now directly addressed by science and engineering projects.
  Science (natural sciences) was originally known as "natural philosophy", it was just a branch of philosophy that happened to be successful enough to stop being called philosophy. So arguably there's still a huge amount of value in the philosophical ideas and attitudes that led to such important discoveries.
- whyowhy3484939 9 months ago
  
  Oh, I agree on the ignorant part, but you can believe me it is not willed. I very much like to be educated on this because nothing I have read - yet - touched these fundamental problems with a ten-foot pole.
  I have elaborated in a sibling comment. I do not dare to ask you to consider it because it is quite loquacious, but in the off-chance you want to indulge in some armchair philosophy you can join me in my pedantry and enlighten a lost soul.
youoy 9 months ago

Pearl's ladder of causation (in maths and AI research) already tells you that causation cannot only be inferred from data alone. If you only have data, then you cannot generally go beyond correlations. This can be solved either by humans hardcoding their knowledge into a model that extracts causation based on that knowledge, of by giving a machine the ability to perform their own experiments. This is because there are some correlations (A - B) that are generated by causation (A -> B) whose direction of the arrow cannot be decided with data alone.
- whyowhy3484939 9 months ago
  
  That's a significant observation by professor Pearl and I don't wish to belittle his contributions, which are significant, but I do wish to stress that it is the 18th century we need to thank for "causation cannot only be inferred from data alone" because I feel we are, to our detriment, unfairly dismissive of past thinkers.
  - youoy 9 months ago
    
    I don't think we are being dismissive, it's just that a mathematical theory is more convincing than a philosophical one in terms of what can and cannot be achieved with data.
    
    whyowhy3484939 9 months ago
    
    Fair enough. I can't say I'm familiar enough with professor Pearl's work to argue productively, but I do thank you for pointing this out. This is most interesting.
baanist 9 months ago

I doubt you are going to change anyone's mind on this who is already convinced that computers can think and reason and all that's required is the right sequence of numbers. Moreover, the internet is awash with bots and AIs working on behalf of governments to spread political and economic propaganda. HN has moderation to avoid it but the moderators are human so they can't foil all attempts and notice all AIs and bots.

Havoc 9 months ago

Surprised by the negativity here. A 7B class model +- doubles gpt4 score and everyone goes “meh”?!?

lmeyerov 9 months ago

I was excited for that headline, but I didn't get a clear & fair sense of comparison, like how prompt engineered etc was the comparison, or a false comparison?
There was not enough detail to determine what was ultimately useful & truly better, if anything. So lessons learned near useless. Likewise, I could not tell how useful the fine-tuning was and why, vs basic other tricks that would avoid all this complexity. The work seems good, but I found almost no scientific value in the experimentation and reporting. So I can't comment because there is little to comment on that I normally would. We focus more on the coding & analysis side, logical QA on fuzzier questions, so I am genuinely curious, supportive, am informed, etc, but left frustrated and wanting my time back.
Workaccount2 9 months ago

The attitude that HN on the whole has towards AI is what you would expect from an existential threat to a very high paying career.
- danpalmer 9 months ago
  
  I've seen broadly 2 attitudes: a refusal to accept AI as an existential threat to a high paying career, and a belief that AI is an existential threat to a high paying career.
  As with most things in life, the truth is likely somewhere boring in the middle.
  - llm_trw 9 months ago
    
    Option 3 anyone who can add Ai to their career with be even more highly paid because of the productivity boost.
rightbyte 9 months ago

"Deepseek" gets very little hype here for some reason. There is this Deepseek Coder llm model, that was comparable to others at the time and could run locally. Like, zero hype.

davidmillerAI 9 months ago

[dead]

tikkun 9 months ago

[Submitted on 23 May 2024]