I remember there being a lot of skepticism about the performance metrics used in that paper last time this topic came up. Vaguely, I think they were comparing the runtime of Wolfram's perfect answer search to their own model which was not required to get the answer exactly right. I think there were also some other shenanigans.
Since the paper was presented, it was reviewed in ICLR and significantly expanded, and many questions raised in September were addressed.
The performance metric is the number of equations/integrals correctly solved over a held out test set, generated randomly. This is done using an external tool (SymPy), which some might consider "cheating" (but you need a way to test, don't you?). Anyway, keep in mind that checking a solution for correctness is much easier than finding one.
Another issue was about using timeout for Wolfram. This is now discussed in the appendix (it makes very little difference).
During review, an interesting point was discussed: out of distribution generalization, or how performance depends on the random problem generator we used. This is now measured and discussed at the end of the appendix.
>During review, an interesting point was discussed: out of distribution generalization, or how performance depends on the random problem generator we used. This is now measured and discussed at the end of the appendix.
Your appendix E poses a really interesting question that I hope future research addresses. The distribution of integrals that appear on math tests are likely to closely match FWD, BWD and IBP because that's how professors write problems for students. However the distribution of integrals that arise in physics and engineering is quite different and includes a lot of ones that aren't solvable in closed form. I wonder what the performance of the system would be on the "engineering distribution." I also wonder what you would have to do to find out what the engineering distribution was...
Textbook problems are usually short, with short solutions,
and demonstrating one specific rule. They are better handled by classical (rule-based) tools. Deep learning tools would either memorize them or resort to a rule based sub-module.
For integrals, solvability depends on the function set you work with. Since we use elementary functions on the real domain, a lot of integrals have no solution. We could have gone for a larger set (adding erf, the Fresnels, up to Liouvillian functions). This would mean more solvable cases.
As for the engineering distribution, no one knows what it is. The best we can do is to generate a broader training set, knowing that it will generalize better (this is the key takeaway of our appendix). BWD+IBP is a step in this direction, but to progress further, we need a better understanding of the problem space, and issues related to simplification. We are working on this now.
>Since we use elementary functions on the real domain, a lot of integrals have no solution. We could have gone for a larger set (adding erf, the Fresnels, up to Liouvillian functions). This would mean more solvable cases.
Here is something to think about if you want more solvable cases: even in the case of sine and cosine, solving a differential equation really means reducing it to a combination of the solutions to simpler differential equations. The sine function can be defined as the solution to a particular differential equation, as can all of the fancier functions. So in a sense it's kind of like factorization, where you have "prime" equations whose only solution is a transcendental function defined as being their solution, and "composite" equations whose solutions can be written as a combination of solutions to "prime" equations. So really all of the rare functions belong to the same general scheme.
For training, you need a generator because you want millions of solved examples for deep learning to work.
At test time, you usually want a test set from the same distribution as the training data (or at least related to it in some controllable way), or it becomes very difficult to interpret the results.
Suppose my test set come from a different and unknown distribution (real problems sampled in some way).
If I get good results, is it because the training worked, or because the test set was "comparatively easy"?
If I get bad results, is it because the model did not learn, or because the test set was too far away from the training examples?
This is also required to get the answer exactly right. It's just that the problem domains they chose -- symbolic integration and differential equations -- can be checked easily. That is, if I have a guess F for the integral of f, I just need to symbolically differentiate F and see whether I get f.*
So the paper does "guess and check," and only returns answers that pass the check, i.e. are exactly right.
There are some problems where all the checks fail, i.e. the paper can't generate an answer. And there are also questions where Matlab / Maple can't find an answer either.
* It's a little more complicated; e.g. if f = cos^2, and the derivative of F is 1 - sin^2, I need to recognize they're the same. But it's safe, in the sense that if I can't determine whether they're the same, I can reject the solution and move on to the next one.
Yes, but symbolic differentiation is straight forward. There are simple rules to apply to differentiate any expression: the chain rule, the product rule, the quotient rule, derivatives of known single argument functions, etc.
Integration is a completely different ball game. Given an arbitrary expression, most likely it's not symbolically integrable, i.e. it's integral can't be expressed as a combination of known functions. That's why there are lots of functions defined as "the integral of blah."
Even given an expression that's known to be symbolically integrable, it's quite the puzzle to figure out what the integral is. There's no straight forward algorithm, as with differentiation. It boils down to "try integration by parts; try this; try that; maybe do some clever algebraic manipulation to get it into a form where you can try one of those things; look it up in a book of integrals to see if it's there, or something similar is there; if all else fails, guess and check."
I was a critic when this was first posted (and was promptly downvoted). I don't understand why this is getting traction (or even a research direction they decided to pursue). does anyone really believe this thing is doing some kind of symbolic reasoning? it's doing nothing but memorizing training examples! that it has nonzero performance in test is owing to the wonky metric.
I am not one that hypes AI and ML, but I don't see the problem with this.
If this new system can integrate some equations that other methods cannot, I welcome it. It is almost instantaneous to check the solutions, so what is the problem?
I imagine the devs at Wolfram Mathematica are already working into adding this new kind of solvers into their overall integrator.
On one hand you have 1980s symbolic AI that can produce true, coherent statements, and on the other you have GPT-2 which can produce really nifty gibberish. If you could put the two together maybe that would be some progress. I can see that getting transformers to emulate symbolic math might be a step in that direction.
that's not reasoning though. it's not even an expert system or prolog (because adversarial examples exist). it's very close to a million monkeys typing.
Raises a tough question about mathematics generally. Is math anything other than memorizing a set of rules (and being able to regurgitate them when shown a problem)? Obviously for anything like arithmetic, multiplication, etc. the answer is "yes this is just memorized".
Then for more complex math, can't it be argued that this is just recombination of memorized theories and algos, but using some kind of random search?
I'm not sure math involves as much symbolic reasoning as we like to think, but I could be wrong. The same goes for programming - mostly just regurgitating known algos (which break into smaller memorized sub-algos). When presented with something novel, generally just use random approaches until one fits. A few people in history who had to come up with "the first" such programs had some novelty, but even those were mostly just applications of known / memorized solutions in a random way (e.g. quicksort).
For at least a hundred years (as long as symbolic logic has existed, however long that really is) it has been widely known that you can iterate through every single theorem without any special creativity or thinking. The problem with that strategy is that you will spend most of your time on theorems like (not not X => X), (not not not not X => X), (not not not not not not X => X) and so on. What mathematicians actually do is use their human intuition and aesthetic sense to pilot the mechanical formalism towards goals that are worth achieving.
That’s not a really fair way of putting it. Unless you’re suggesting that there’s something deeply magical about our human “aesthetic” intuition why can’t that itself be coded into a set of syntactic indicators (e.g., favor short equations, don’t repeat yourself, connecting theorems otherwise far from each other in some distance metric is valuable) to be used as a search heuristic?
The human aesthetic can't practically be programmed as a classical algorithm. The human ability to identify cats can't even be programmed as a classical algorithm, that's why neural networks were invented. Maybe a neural network could predict how interesting mathematicians would find a theorem but by that point there's no point for anybody to do anything, even artists.
differential equations at this level (ie odes that have tabulated solutions or are a transformation away from such an ode) are definitely a set of rules that you memorize. differential equations as an active area of research is much more abstract, dealing with things like existence/uniqueness of solutions (when picard-lindelof doesn't apply), or stability when a solution is known to exist but cannot be analytically written down (and so integration has to be performed). hell one of the clay problems is about a solution to a pde.
so no it's not this straightforward except in basically toy examples.
Clearly, if your generator for equations is limited the system can learn the dictionary
lhs (integral) => rhs (derivative) to go rhs => lhs. (integrate)
Arxiv paper: https://arxiv.org/pdf/1912.01412.pdf