 |
Groupe de
Recherche sur
l'Apprentissage
Automatique
(Research Group on Machine Learning) |
setting-up | research themes
| publications | tree automata
Théorie
Algorithmique de l'Information - TAI 2000
The Setting-up of the team is the following
:
The research team is mainly composed of faculty members in computer
science of the Université Charles de Gaulle -
Lille 3 . At the moment, this team
is part of the Laboratory of Fundamental Computer Science of Lille or
LIFL (Laboratoire d'Informatique Fondamentale
de Lille - URA 369 CNRS) situated in the Université des Sciences et
Technologies de Lille - Lille 1.
Persons in charge :
François Denis, assistant
professor in computer science, Lille 3 ;
Rémi Gilleron, professor
in computer science, Lille 3 ;
Members :
Francesco De Comité, assistant
professor in computer science, Lille 1 ;
Isabelle
Tellier, assistant professor in computer science, Lille 3 ;
Alain
Terlutte, assistant professor in computer science, Lille 3 ;
Marc Tommasi, assistant professor
in computer science, Lille3.
PhD students :
Cyrille D'Halluin, PhD attended in july, 1998 ;
Jean Simon, PhD attended in december 1999 ;
Fabien Letouzey, PhD started in september 1998.
Aurélien Lemay, PhD started in september 1999.
Our main interest is Machine Learning, i.e. the study and design of software
improving through experience. This theme is developped according to three
directions of research : formal models of learning ; natural language
learning ; learning from unlabeled examples and
applications.
Formal models of learning :
In Machine Learning, heuristics are ususally justified in terms of algorithmic
complexity (principle of Occam Razor, Minimal Description Length, minimization
of the empirical risk). On the other hand, in natural learning, teachers seem
to give a higher priority to simple or highly informative examples. The
Kolmogorov complexity is a nearly perfect measure of the algorithmic complexity
of an object and the Solomonoff-Levin distribution gives greater importance
to objects simple relatively to this measure, (i.e. associated with a low
Kolmogorov complexity). Our purpose is to provide a safe basis for heuristics
used in natural and machine learning by making use of this complexity notion,
and to define formal models of learning in which they are theoretically
justified. Our main results are the following :
-
Definition of formal models : We have defined two extensions of the
Probably Approximately Correct (PAC) model of Valiant. The first one is the
learning model from simple examples, in which the examples are provided according
to the Solomonoff-Levin distribution relatively to the target (Denis et al
96). The second one is the learning with a helpful teacher, in which the
only admitted distributions of examples are the ones assigning a non
null probability to the members of a characteristic sample set associated
with the target (Denis et Gilleron 97), (Denis et al 97).
-
Learning from positive examples : In natural situations, it is often
noticed that the learner has to content himself with positive data. In this
context, it is hard to explain how human learners avoid over-generalization.
We have defined an extension of the Valiant's PAC model dedicated to the
learning from positive examples. In this model, the learner receives both
positive and unlabeled examples from an oracle, and uses the unlabeled data
to infer information about the underlying distribution. This model can also
be extended to learning by queries (Denis 98) and is applied in the domain
of learning by decision trees described further.
-
Learning Regular languages : Our study concerning learning models
gave rise to theoretical results about the learning of regular languages.
To empirically validate these results, a platform dedicated to the comparative
study of various algorithm performances has been programmed and tested (PhD
dissertation of Cyrille D'Halluin 98).
-
Formal theories and natural learning : A joined analysis of certain
aspects of learning models in psychology and computer science is being performed.
This work is the subject of the PhD dissertation of Jean Simon.
Natural language learning
Natural language learning is a universal human process, based on positive
data only (i.e. syntactically correct sentences). But, from a theoretical
point of view, natural languages are at least context-free and this class
is not learnable from positive data in usual learning models. To overcome
this paradox, various solutions have been proposed : for example, following
the chomskian intuition, it can be admitted that natural languages belong
to a restricted family whose structural properties are innately known by
the human mind, or that the examples are provided by a helpful teacher making
the learning possible. Our idea is to explore another possible explanation,
taking into account semantic information attached to the examples.
We have developped a syntactico-semantic model of language learning based
on the Principle of Compositionality, which precisely describes the link
between syntax and semantics. This model has been examplified in the context
of various kinds of Categorial Grammars, associated with logical sematics.
The main results obtained are :
-
Syntactic analysis with categorial grammars : we have proposed a new
algorithm for the syntactic analysis with categorial grammars, and study
its complexity (Finkel et Tellier 96).
-
Syntactico-semantic learning : We have defined a new learning model
in which both syntax and semantic information are used. This model can be
adapted to special cases of categorial grammars (Tellier 98a, 98b, 98c) or
to any linguistic formalism satisfying the Principle of Compositionality
(Tellier 99). This model gave rise to a software implementation to be still
improved, taking into account the heuristics inspired by the formal models
of learning from simple examples and from positive examples defined in previous
works (see below).
Decision trees and unlabeled examples :
In many learning problems, labeled examples are rare or expensive
while numerous unlabeled and positive examples are available. Thus we
address the problem of learning with the help of positive and
unlabeled data given a small number of labeled examples. We find both
theoretical and empirical arguments showing that learning algorithms
can be improved by the use of both unlabeled and positive data. We
give theoretical results for the improvement of Statistical Query
learning algorithms from positive and unlabeled data. We apply these
ideas to tree induction algorithms. We modify the code of c4.5 to get
an algorithm which takes as input a set LAB of labeled examples, a set
POS of positive examples and a set UNL of unlabeled data and which
uses these three sets to construct the decision tree (Decomité et al 99).
List of publications and reports :
International Journal :
-
F. Denis, Finding a minimal 1-DNF consistent with a positive sample is
LogSNP-complete, Information Processing Letters, 69, 1-5,
(1999).
-
F. Denis, Learning regular languages from simple positive
examples, to appear in Machine Learning, rapport technique du LIFL IT 321, version
ps
,
(1998).
-
F. Denis and R. Gilleron, PAC Learning under Helpful
Distributions, submitted (1997).
-
I. Tellier, Learning to Understand, submitted, (1999).
-
A. Finkel and I. Tellier, A polynomial algorithm for the membership problem
with categorial grammars, Theoretical Computer Science 164:1-2, 207--221,
(1996).
International Conferences
- F. De Comité, F. Denis, R. Gilleron et F. Letouzey, Positive and
Unlabeled Examples help Learning, ALT 99, LNAI 1720, 219-230,
version
ps
-
F. Denis, PAC Learning from Positive Statistical Queries, ALT 98, 9th
International Conference on Algorithmic Learning Theory, LNAI 1501, 112-126,
(1998). version
ps
-
I. Tellier, Syntactico-Semantic Learning of Categorical Grammars,
Proceedings of NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding
in Language Learning, 311--314, (1998). version
ps
-
I. Tellier, Meaning helps learning Syntax, Proceedings of ICGI'98 Workshop
on Grammatical Inference, LNCS 1433, 25-36, (1998).
-
F. Denis and R. Gilleron, PAC Learning under Helpful
Distributions, ALT 97, 8th International Workshop on Algorithmic
Learning Theory, LNAI 1316, 132-145, (1997). version
ps
-
F. Denis, C. D'Halluin and R. Gilleron, PAC Learning with Simple Examples,
STACS 96, 13th Annual Symposium on Theoretical Aspects of Computer Science,
LNCS 1046, 231--242,
1996.version
ps
Rémi Gilleron and Marc Tommasi are also active researchers in the
domain of rewriting techniques and software analysis. They work on set constraint
resolution and on the theory of rewriting terms. They also took part in the
writing of a collective book on tree automata and their applications. Main
references on the subject are :
-
H. Comon, M. Dauchet, R. Gilleron, D. Lugiez, S. Tison et
M. Tommasi, Tree automata techniques and applications, page TATA
-
R. Gilleron, S. Tison et M. Tommasi, Set constraints and Automata,
Information and Computation, 149, 1-41 (1999).
-
F. Seynhaeve, S. Tison, M. Tommasi et R. Treinen, Grid structures
and undecidable constraint theories, TAPSOFT 97, LNCS 1214,
357-368, (1997).
-
R. Gilleron et S. Tison, Regular tree languages and rewrite
systems, Fundamenta Informaticae, 24, 157-176, (1995).
