The datasets I used (original data from the UCI Machine Learning Repository and my own cross-validation files).
See also :
- our Information Extraction Corpus used in Patrick Marty's PhD thesis ;
- UCI Machine Learning Repository ;
- IDA Benchmark Repository ;
- Grammatical Inference Benchmarks Repository.
Mushroom Database (agaricus-lepiota)
Description
- original description : [ agaricus-lepiota.txt ]
- 2 classes, 8124 instances, 22 attributes (all nominal)
- 1.4 % missing values
- best observed accuracy: 100.0 % (majority class: 51.8 %)
Downloads
- complete dataset in C4.5 format : [ agaricus-lepiota.names ][ agaricus-lepiota.data ]
- cross-validation files : [ agaricus-lepiota.tar.gz ]
Standardized Audiology Database (audiology)
Description
- original description : [ audiology.txt ]
- 24 classes, 226 instances, 70 attributes (all nominal)
- 2.0 % missing values
- best observed accuracy: 88.0 % (majority class: 25.2 %)
Downloads
- complete dataset in C4.5 format : [ audiology.names ][ audiology.data ]
- cross-validation files : [ audiology.tar.gz ]
ML94/COLT94 Badge Problem (badges)
Description
- original description : [ badges.txt ]
- 2 classes, 294 instances, 9 attributes (all nominal)
- no missing values
- best observed accuracy: 98.7 % (majority class: 71.4 %)
Downloads
- complete dataset in C4.5 format : [ badges.names ][ badges.data ]
- cross-validation files : [ badges.tar.gz ]
blood-transfusion (blood-transfusion)
Description
- original description : [ blood-transfusion.txt ]
- 2 classes, 748 instances, 4 attributes (all numeric)
- no missing values
- percent of instances in the majority class: 76.2 %
Downloads
- complete dataset in C4.5 format : [ blood-transfusion.names ][ blood-transfusion.data ]
Wisconsin Breast Cancer (breast-cancer)
Description
- original description : [ breast-cancer.txt ]
- 2 classes, 699 instances, 10 attributes (all numeric)
- 0.2 % missing values
- best observed accuracy: 97.0 % (majority class: 65.5 %)
Downloads
- complete dataset in C4.5 format : [ breast-cancer.names ][ breast-cancer.data ]
- cross-validation files : [ breast-cancer.tar.gz ]
Car Evaluation Database (car)
Description
- original description : [ car.txt ]
- 4 classes, 1728 instances, 6 attributes (all nominal)
- no missing values
- best observed accuracy: 99.5 % (majority class: 70.0 %)
Downloads
- complete dataset in C4.5 format : [ car.names ][ car.data ]
- cross-validation files : [ car.tar.gz ]
Contraceptive Method Choice (cmc)
Description
- original description : [ cmc.txt ]
- 3 classes, 1473 instances, 9 attributes (2 numeric and 7 nominal)
- no missing values
- best observed accuracy: 55.2 % (majority class: 42.7 %)
Downloads
- complete dataset in C4.5 format : [ cmc.names ][ cmc.data ]
- cross-validation files : [ cmc.tar.gz ]
Credit Approval (crx)
Description
- original description : [ crx.txt ]
- 2 classes, 690 instances, 15 attributes (6 numeric and 9 nominal)
- 0.6 % missing values
- best observed accuracy: 86.6 % (majority class: 55.5 %)
Downloads
- complete dataset in C4.5 format : [ crx.names ][ crx.data ]
- cross-validation files : [ crx.tar.gz ]
Dermatology Database (dermatology)
Description
- original description : [ dermatology.txt ]
- 6 classes, 366 instances, 34 attributes (1 numeric and 33 nominal)
- 0.1 % missing values
- best observed accuracy: 96.9 % (majority class: 30.6 %)
Downloads
- complete dataset in C4.5 format : [ dermatology.names ][ dermatology.data ]
- cross-validation files : [ dermatology.tar.gz ]
Protein Localization Sites (ecoli)
Description
- original description : [ ecoli.txt ]
- 8 classes, 336 instances, 8 attributes (all numeric)
- no missing values
- best observed accuracy: 85.4 % (majority class: 42.6 %)
Downloads
- complete dataset in C4.5 format : [ ecoli.names ][ ecoli.data ]
- cross-validation files : [ ecoli.tar.gz ]
Glass Identification (glass)
Description
- original description : [ glass.txt ]
- 6 classes, 214 instances, 10 attributes (all numeric)
- no missing values
- best observed accuracy: 95.5 % (majority class: 35.5 %)
Downloads
- complete dataset in C4.5 format : [ glass.names ][ glass.data ]
- cross-validation files : [ glass.tar.gz ]
Hepatitis Domain (hepatitis)
Description
- original description : [ hepatitis.txt ]
- 2 classes, 155 instances, 19 attributes (6 numeric and 13 nominal)
- 5.7 % missing values
- best observed accuracy: 85.2 % (majority class: 79.4 %)
Downloads
- complete dataset in C4.5 format : [ hepatitis.names ][ hepatitis.data ]
- cross-validation files : [ hepatitis.tar.gz ]
Horse Colic Database (horse-colic)
Description
- original description : [ horse-colic.txt ]
- 2 classes, 368 instances, 23 attributes (7 numeric and 16 nominal)
- 22.8 % missing values
- best observed accuracy: 86.4 % (majority class: 63.0 %)
Downloads
- complete dataset in C4.5 format : [ horse-colic.names ][ horse-colic.data ]
- cross-validation files : [ horse-colic.tar.gz ]
1984 United States Congressional Voting Records Database (house-votes-84)
Description
- original description : [ house-votes-84.txt ]
- 2 classes, 435 instances, 16 attributes (all nominal)
- 5.6 % missing values
- best observed accuracy: 96.8 % (majority class: 61.4 %)
Downloads
- complete dataset in C4.5 format : [ house-votes-84.names ][ house-votes-84.data ]
- cross-validation files : [ house-votes-84.tar.gz ]
Ionosphere
Description
- original description : [ ionosphere.txt ]
- 2 classes, 351 instances, 34 attributes (all numeric)
- no missing values
- best observed accuracy: 93.8 % (majority class: 64.1 %)
Downloads
- complete dataset in C4.5 format : [ ionosphere.names ][ ionosphere.data ]
- cross-validation files : [ ionosphere.tar.gz ]
Iris Plant (iris)
Description
- original description : [ iris.txt ]
- 3 classes, 150 instances, 4 attributes (all numeric)
- no missing values
- best observed accuracy: 96.7 % (majority class: 33.3 %)
Downloads
- complete dataset in C4.5 format : [ iris.names ][ iris.data ]
- cross-validation files : [ iris.tar.gz ]
MAGIC gamma telescope data 2004 (magic04)
Description
- original description : [ magic04.txt ]
- 2 classes, 19020 instances, 10 attributes (all numeric)
- no missing values
- percent of instances in the majority class: 64.8 %
Downloads
- complete dataset in C4.5 format : [ magic04.names ][ magic04.data ]
Ozone Level Detection (ozone)
Description
- original description : [ ozone.txt ]
- 2 classes, 2536 instances, 73 attributes (all numeric)
- 8.1 % missing values
- percent of instances in the majority class: 97.1 %
Downloads
- complete dataset in C4.5 format : [ ozone.names ][ ozone.data ]
Parkinsons Data Set (parkinsons)
Description
- original description : [ parkinsons.txt ]
- 2 classes, 195 instances, 23 attributes (all numeric)
- no missing values
- percent of instances in the majority class: 75.4 %
Downloads
- complete dataset in C4.5 format : [ parkinsons.names ][ parkinsons.data ]
Pima Indians Diabetes (pima)
Description
- original description : [ pima.txt ]
- 2 classes, 768 instances, 8 attributes (all numeric)
- no missing values
- best observed accuracy: 75.4 % (majority class: 65.1 %)
Downloads
- complete dataset in C4.5 format : [ pima.names ][ pima.data ]
- cross-validation files : [ pima.tar.gz ]
Promoter Gene Sequences Database (promoters)
Description
- original description : [ promoters.txt ]
- 2 classes, 106 instances, 57 attributes (all nominal)
- no missing values
- best observed accuracy: 96.2 % (majority class: 50.0 %)
Downloads
- complete dataset in C4.5 format : [ promoters.names ][ promoters.data ]
- cross-validation files : [ promoters.tar.gz ]
Sonar: Mines vs. Rocks (sonar)
Description
- original description : [ sonar.txt ]
- 2 classes, 208 instances, 60 attributes (all numeric)
- no missing values
- best observed accuracy: 85.5 % (majority class: 53.4 %)
Downloads
- complete dataset in C4.5 format : [ sonar.names ][ sonar.data ]
- cross-validation files : [ sonar.tar.gz ]
Spambase Data Set (spambase)
Description
- original description : [ spambase.txt ]
- 2 classes, 4601 instances, 57 attributes (all numeric)
- no missing values
- percent of instances in the majority class: 60.6 %
Downloads
- complete dataset in C4.5 format : [ spambase.names ][ spambase.data ]
Tic-Tac-Toe Endgame (tic-tac-toe)
Description
- original description : [ tic-tac-toe.txt ]
- 2 classes, 958 instances, 9 attributes (all nominal)
- no missing values
- best observed accuracy: 100.0 % (majority class: 65.3 %)
Downloads
- complete dataset in C4.5 format : [ tic-tac-toe.names ][ tic-tac-toe.data ]
- cross-validation files : [ tic-tac-toe.tar.gz ]
Vowel Recognition (vowel)
Description
- original description : [ vowel.txt ]
- 11 classes, 990 instances, 10 attributes (all numeric)
- no missing values
- best observed accuracy: 93.7 % (majority class: 9.1 %)
Downloads
- complete dataset in C4.5 format : [ vowel.names ][ vowel.data ]
- cross-validation files : [ vowel.tar.gz ]
Wine Recognition (wine)
Description
- original description : [ wine.txt ]
- 3 classes, 178 instances, 13 attributes (all numeric)
- no missing values
- best observed accuracy: 97.7 % (majority class: 39.9 %)
Downloads
- complete dataset in C4.5 format : [ wine.names ][ wine.data ]
- cross-validation files : [ wine.tar.gz ]
Zoo database (zoo)
Description
- original description : [ zoo.txt ]
- 7 classes, 101 instances, 17 attributes (all nominal)
- no missing values
- best observed accuracy: 97.3 % (majority class: 40.6 %)
Downloads
- complete dataset in C4.5 format : [ zoo.names ][ zoo.data ]
- cross-validation files : [ zoo.tar.gz ]




