Publication
A Bayes-True Data Generator for Evaluation of Supervised and Unsupervised Learning Methods
Janick Frasch; Aleksander Lodwich; Faisal Shafait; Thomas Breuel
In: Pattern Recognition Letters (PRL), Vol. 32, No. 11, Pages 1523-1531, Elsevier, 8/2011.
Abstract
Benchmarking pattern recognition, machine learning and data mining methods commonly relies on real-world data sets.
However, there are some disadvantages in using real-world data. On one hand collecting real-world data can become
diffcult or impossible for various reasons, on the other hand real-world variables are hard to control, even in the problem
domain; in the feature domain, where most statistical learning methods operate, exercising control is even more difficult
and hence rarely attempted. This is at odds with the scientific experimentation guidelines mandating the use of as
directly controllable and as directly observable variables as possible. Because of this, synthetic data possesses certain
advantages over real-world data sets. In this paper we propose a method that produces synthetic data with guaranteed
global and class-specific statistical properties. This method is based on overlapping class densities placed on the corners
of a regular k-simplex. This generator can be used for algorithm testing and fair performance evaluation of statistical
learning methods. Because of the strong properties of this generator researchers can reproduce each others experiments
by knowing the parameters used, instead of transmitting large data sets