I recently saw this video about a model called Minerva, developed by a couple of Google teams, that appears to be much better at the kind of mathematical reasoning required for A-level maths than I’d have expected: Is Google’s New AI As Smart As A Human? 🤖 - YouTube . See this blog post too: Minerva: Solving Quantitative Reasoning Problems with Language Models – Google AI Blog, and the sample results: Minerva Explorer . It’s surprising because it doesn’t use the reasoning frameworks provided by theorem provers, nor does it even delegate the necessary calculations to a maths library like NumPy; it’s just learnt the patterns in how people write solutions to these kinds of problems.
I haven’t read much of the paper yet, but I’m very curious about what the limits of this approach are. E.g. how well does it generalise to question styles that were not present in the training set? How many significant figures can the numbers in the questions have before its success rate plummets? Interestingly, the paper says that the largest model (540 billion parameters), which achieved the best performance, is undertrained. So it would also be interesting to know how much better it would be if it was fully trained. More importantly, it will also be interesting to see if it will ever be possible to achieve comparable results with much smaller models, because the financial cost of training such massive models puts this approach out of reach for most organisations, and the environmental cost is troubling too.