regression_analysis_101
Estimating the parameters of a model by optimising the fit of the model to empirical data is a staple of computational data science.
This repository contains a set of Jupyter notebooks that demonstrate how to implement a range of regression methods and a Bayesian MCMC approach for fitting models and estimating model parameters. These are demonstrated for a variety of geoscience type applications, e.g. fitting an isochron to isotope data to estimate a geologcal age, fitting observed Bouguer gravity anomaly data from the Isle of Mull in Scotland using simple gravity model for a buried spherical object, and estimating parameters for a simple continental geotherm model based on PT estimates derived from mantle xenolith geochemistry.
Various other routine coding tasks (e.g. reading and writing data files) and plotting techniques (using Matplotlib
routines) are also illustrated.
A minimal example-fitting a straight line to set of x and y values
A minimal example is fitting a simple linear model, i.e. a straight line, , to some set of measurements of y (commonly called the dependent variable or response variable) for a given range of x values (commonly called the predicter variable or independent variable), to estimate the slope, m, and intercept c, of the line.
A routine method for achieving this is Ordinary Least Squares regression (OLS). OLS aims to minimise the difference between the observed values of y and the values predicted by the model. It achieves this by finding the model that has the smallest value of the sum of the squared differences, these differences are usually called the residuals. These are illustrated below by the vertical red lines between the green observed values and the blue line representing the model predictions. The routine in the scipy package that implements this is scipy.stats.linregress
.
Taking care of the errors-how to handle uncertainty in measured data
OLS is quick and straightforward, but does not take account of the magnitude of measurement uncertainties on x or y values. Orthogonal Distance Regression (ODR) enables measurement errors in both variables to be accounted for, and is particularly appropropriate if the measured data includes outliers with large errors.
The Bayesian approach using a Markov Chain Monte Carlo sampling strategy
An alternative to using regression techniques to estimating model parameters is to use a Bayesian approach and a Markov Chain Monte Carlo sampling strategy.
An advantage of the Bayesian MCMC technique is that you can treat both the model parameters and the observed data as uncertain, and estimate their likely values (credible intervals) and the posterior probability distributions. This can be implemented in python using the pyMC3
package. See the pyMC3 docs page for details.