Learning with Capacity Control: A Semi-Supervised Approach.




A very basic problem in machine learning is the overfitting of empirical data. This problem occurs where learning processes construct overly strong models to explain the dependencies within empirical data. Often these strong models fail to perform well on the unseen data due to the strong bias towards the empirical data used in the learning task.


In order to prevent this overfitting problem, learning algorithms use different remedies. For example regularization and early stopping in neural networks help to yield better models that perform well on unseen data. Pruning is used in decision trees. Margin maximization techniques are used in Support Vector Machines. Such solutions, in general, are examples of capacity control techniques. By limiting the richness and the flexibility of the learning method, we expect to prevent overfitting problem.


In this research, we introduce learning methods with new ways of handling capacity control. These methods are used in supervised, unsupervised and semi-supervised learning approaches by incorporating all the available information on hand. Semi-supervised learning can exploit both labeled and unlabeled data. We first propose a genetic algorithm based model that uses unsupervised learning and unlabeled data (semi-supervised learning) as a new form of capacity control. We then use unlabeled data to influence margin maximization in support vector machines as an alternative form of capacity control based on all available information. Finally, we use capacity control in the label space for a boosting approach which combines the outputs of the many weak learning models by linear weighting. In general, we propose semi-supervised learning models based on both labeled and unlabeled data as new ways of capacity control. Methods proposed in this research have shown strong results on benchmark problems compared to alternative approaches.