TITLE: On Optimal Data Split for Generalization Estimation and Model
Selection
AUTHOR: Jan Larsen and Cyril Goutte
Department of Mathematical Modelling, Building 321
Technical University of Denmark, DK-2800 Lyngby, Denmark
emails: jl,cg@imm.dtu.dk
www: http://eivind.imm.dtu.dk
ABSTRACT:
Modeling with flexible models, such as neural networks, requires
careful control of the model complexity and generalization ability of
the resulting model. Whereas general asymptotic estimators of
generalization ability have been developed over recent years
it is widely acknowledged that in most modeling scenarios there is
insufficient data available to reliably use these estimators for assessing
generalization, or select/optimize models. As a consequence, one
resorts to resampling techniques like cross-validation
jackknife or bootstrap. In this paper, we
address a crucial problem of cross-validation estimators:
how to split the data into various sets.
We are concerned with studying the very different behavior of
the two data splits
hold-out cross-validation, K-fold cross-validation and
randomized permutation cross-validation.
The theoretical basics of various cross-validation techniques with the
purpose of reliably estimating the generalization error and optimizing
the model structure is described. Theoretical and numerical experiments
clarify the very different behaviour of the data splitting.
Siubmitted for IEEE Nueral Networks for Signal Processing, 1999.