For the machine learning concept see Overfitting (machine learning)
Noisy (roughly linear) data is fit to both linear and polynomial functions. Although the polynomial function passes through each data point, and the line passes through few, the line is a better fit because it does not have the large excursions at the ends. If the regression curves were used to extrapolate the data, the overfit would do much worse.

In statistics, overfitting is fitting a statistical model that has too many parameters. An absurd and false model may fit perfectly if the model has enough complexity by comparison to the amount of data available. When the degrees of freedom in parameter selection exceed the information content of the data, this leads to arbitrariness in the final (fitted) model parameters which reduces or destroys the ability of the model to generalize beyond the fitting data. The likelihood of overfitting depends not only on the number of parameters and data but also the conformability of the model structure with the data shape, and the magnitude of model error compared to the expected level of noise or error in the data.

Even when the fitted model does not have unusually many degrees of freedom, it is to be expected that the fitted relationship will appear to perform less well on a new data set than on the data set used for fitting.[1] In particular, the value of the coefficient of determination will shrink relative to the original training data.

In both statistics and machine learning, in order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, Bayesian priors on parameters or model comparison), that can indicate when further training is not resulting in better generalization.

See also

References

  1. ^ Everitt B.S. (2002) Cambridge Dictionary of Statisics, CUP

External links


No comments have been added.



Your name:

City:

Country:

Your comments:

Security check *
(Please enter the number into adjoining box)