Symbolic Regression: The Forgotten Machine Learning Method (2024)

Turning data into formulas can result in simple but powerful models

Symbolic Regression: The Forgotten Machine Learning Method (1)

Published in

Towards Data Science

·

4 min read

·

Nov 17, 2020

--

Symbolic Regression: The Forgotten Machine Learning Method (3)

The goal of a regression model is very simple: take as input one or more numbers and output another number. There are many ways to do that, from simple to extremely complex.

The simplest case is that of linear regression: the output is a linear combination of the input variables, with coefficients chosen to minimize some training error. In many contexts, a simple model like this will be enough, but it will fail in cases where nonlinear relationships between the variables are relevant. In the strongly nonlinear world that we live in, this happens very often.

On the other side of the spectrum of model complexity are black-box regressors like neural networks, which transform the input data through a series of implicit calculations before giving a result. Those models are very popular nowadays due to the promise that they will one day result in a general “artificial intelligence”, and due to their striking success in difficult problems like computer vision.

Here we want to discuss a middle ground between those two extremes that seems to not have received the attention that it deserves so far: symbolic regression.

A generalization of the concept of linear regression or polynomial regression is to try to search over the space of all possible mathematical formulas for the ones that best predict the output variable taking as input the input variables, starting from a set of base functions like addition, trigonometric functions, and exponentials. This is the basic idea of symbolic regression.

In a symbolic regression optimization, it is important to discard a large formula if a smaller one with the same accuracy is encountered. This is necessary to avoid obviously redundant solutions like f(x)=x+1–1+0+0+0, and also to not settle for a huge polynomial with 100% accuracy.

This method was popularized in 2009 with the introduction of a desktop software called Eureqa [1], which used a genetic algorithm to search for relevant formulas. This software gained notoriety with the promise that it could eventually be used to derive new laws of physics from empirical data — a promise that was never quite fulfilled. In 2017 Eureqa was aqcuired by a consulting company and left the market [2].

Recently new symbolic regression tools have been developed, such as TuringBot [3], a desktop software for symbolic regression based on simulated annealing. The promise of deriving physical laws from data with symbolic regression has also been revived with a project called Feynman AI, lead by famous physicist Max Tegmark [4].

Despite the efforts to promote symbolic regression over the years, the truth is that this method has never gained mainstream popularity. In an academic context, research on hot topics like neural networks is much more tractable, given that optimal algorithms are known for training the model. Symbolic regression is just messier and often depends on shady heuristics to work efficiently.

But this should not be a reason to disregard the method. Even though it is hard to generate symbolic models, they have some very desirable characteristics. For starters, a symbolic model is explicit, making it explainable and offering insight into the data. It is also simple, given that the optimization will actively try to keep the formulas as short as possible, which could potentially reduce the chances of overfitting the data. From a technical point of view, a symbolic model is very portable and can be easily implemented in any programming language, without the need for complex data structures.

Perhaps Eureqa’s glib promise of uncovering laws of Physics with symbolic regression will never be fulfilled, but it could well be the case that many machine learning models deployed today are more complex than necessary, going to great lengths to do something that could be equivalently done by a simple mathematical formula. This is particularly true for problems in a small number of dimensions — symbolic regression is unlikely to be useful for problems like image classification, which would require enormous formulas with millions of input parameters. A shift to explicit symbolic models could bring to light many hidden patterns in the sea of datasets that we have at our disposal today.

[1] Schmidt M., Lipson H. (2009) “Distilling Free-Form Natural Laws from Experimental Data”, Science, Vol. 324, no. 5923, pp. 81–85.

[2] DataRobot Acquires Nutonian (2017)

[3] TuringBot: Symbolic Regression Software (2020)

[4] Udrescu S.-M., Tegmark M. (2020) “AI Feynman: A physics-inspired method for symbolic regression”, Science Advances, Vol. 6, no. 16, eaay2631

Symbolic Regression: The Forgotten Machine Learning Method (2024)

FAQs

What is the symbolic regression method? ›

Symbolic regression works by employing evolutionary algorithms, often inspired by natural selection and genetics, to search for the most suitable mathematical expressions. It iteratively optimizes these expressions based on predefined criteria, such as accuracy and complexity, to find the best representative model.

What is symbolic regression using LLM? ›

Symbolic Regression (SR) is a task which aims to extract the mathematical expression underlying a set of empirical observations. Transformer-based methods trained on SR datasets detain the current state-of-the-art in this task, while the application of Large Language Models (LLMs) to SR remains unexplored.

What is symbolic regression implementation? ›

The symbolic regression problem for mathematical functions has been tackled with a variety of methods, including recombining equations most commonly using genetic programming, as well as more recent methods utilizing Bayesian methods and neural networks.

What is the Python library for symbolic regression? ›

SymReg is a Symbolic Regression library aimed to be easy to use and fast. You can use it to find expressions trying to explain a given output from given inputs. The expressions can use arbitrary building blocks, not just weighted sums as in linear models.

Is symbolic regression NP hard? ›

Our main contribution here was to prove that symbolic regression (SR), i.e., the problem of discovering an accurate model of data in the form of a mathematical expression, is in fact NP-hard.

What is the symbolic learning method? ›

Symbolic learning, the classical artificial intelligence, is a set of methods for learning symbolic equations from data and numerical functions.

What is an example of a LLM model? ›

For instance, in the sentence "The quick brown fox jumped over the lazy dog," the letters "e" and "o" are the most common, appearing four times each. From this, a deep learning model could conclude (correctly) that these characters are among the most likely to appear in English-language text.

What is symbolic regression interpretability? ›

Interpretability in Symbolic Regression: a benchmark of Explanatory Methods using the Feynman data set. In some situations, the interpretability of the machine learning models plays a role as important as the model accuracy.

What is genetic programming for symbolic regression? ›

The method utilizes an inherently interpretable algorithm, genetic programming based symbolic regression. Unlike conventional accuracy measures in machine learning we demonstrate the ability to recover true analytic solutions, as opposed to a numerical approximation.

What are the symbolic regression operators? ›

A symbolic regression model is composed of three different kinds of so-called operators: Functions which are basic mathematical functions such as cos and log, Constants which are simply floating point values such as 3.14 and 2.17, Variables which are features in a dataset such as x0 and x1.

What is symbolic model in AI? ›

Symbolic AI algorithms work by processing symbols, which represent objects or concepts in the world, and their relationships. The main approach in Symbolic AI is to use logic-based programming, where rules and axioms are used to make inferences and deductions.

What are the symbols for regression? ›

The symbol X represents the independent variable. The symbol a represents the Y intercept, that is, the value that Y takes when X is zero. The symbol b describes the slope of a line. It denotes the number of units that Y changes when X changes 1 unit.

What is an example of symbolic regression? ›

Symbolic regression is typically used for tasks that have a set of observed variables and a prediction that should be made. For example, a large set of income data for individuals and an optimal mortgage loan prediction.

What is symbolic regression in scikit-learn? ›

This is motivated by the scikit-learn ethos, of having powerful estimators that are straight-forward to implement. Symbolic regression is a machine learning technique that aims to identify an underlying mathematical expression that best describes a relationship.

Which regression Python library is best? ›

scikit-learn , or sklearn for short, is the basic toolbox for anyone doing machine learning in Python. It is a Python library that contains many machine learning tools, from linear regression to random forests — and much more.

What is symbolic regression for scientific discovery? ›

The paper is about symbolic regression for scientific discovery (SRSD). Symbolic regression (SR) is a task of producing a mathematical expression (symbolic expression) in a human understandable manner that fits a given dataset.

What are the two types of regression methods? ›

The two basic types of regression are simple linear regression and multiple linear regression, although there are nonlinear regression methods for more complicated data and analysis.

What is symbolic regression neuroscience? ›

Symbolic regression represents a key method to learn interpretable models in a purely data-driven manner. Recent developments in the symbolic regression field have shown that the use of deep neural networks boosts the performance of these methods.

Top Articles
Latest Posts
Article information

Author: Moshe Kshlerin

Last Updated:

Views: 5979

Rating: 4.7 / 5 (57 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Moshe Kshlerin

Birthday: 1994-01-25

Address: Suite 609 315 Lupita Unions, Ronnieburgh, MI 62697

Phone: +2424755286529

Job: District Education Designer

Hobby: Yoga, Gunsmithing, Singing, 3D printing, Nordic skating, Soapmaking, Juggling

Introduction: My name is Moshe Kshlerin, I am a gleaming, attractive, outstanding, pleasant, delightful, outstanding, famous person who loves writing and wants to share my knowledge and understanding with you.