Groupe de Recherche sur l'Apprentissage Automatique

(Research Group on Machine Learning)


setting-up | research themes | publications | tree automata

Théorie Algorithmique de l'Information - TAI 2000


The Setting-up  of the team is the following :

  The research team is mainly composed of faculty members in computer science of the Université Charles de Gaulle - Lille 3 . At the moment, this team is part of the Laboratory of Fundamental Computer Science of Lille or LIFL (Laboratoire d'Informatique Fondamentale de Lille - URA 369 CNRS) situated in the Université des Sciences et Technologies de Lille - Lille 1.
 

Persons in charge :

François Denis, assistant professor in computer science, Lille 3 ;

Rémi Gilleron, professor in computer science, Lille 3 ;

Members :

Francesco De Comité, assistant professor in computer science, Lille 1 ;

Isabelle Tellier, assistant professor in computer science, Lille 3 ;

Alain Terlutte, assistant professor in computer science, Lille 3 ;

Marc Tommasi, assistant professor in computer science, Lille3.

PhD students :

Cyrille D'Halluin, PhD attended in july, 1998 ;

Jean Simon, PhD attended in december 1999 ;

Fabien Letouzey, PhD started in september 1998.

Aurélien Lemay, PhD started in september 1999.
 


Research themes :

Our main interest is Machine Learning, i.e. the study and design of software improving through experience. This theme is developped according to three directions of research : formal models of learning ; natural language learning ; learning from unlabeled examples and applications.

Formal models of learning :

In Machine Learning, heuristics are ususally justified in terms of algorithmic complexity (principle of Occam Razor, Minimal Description Length, minimization of the empirical risk). On the other hand, in natural learning, teachers seem to give a higher priority to simple or highly informative examples. The Kolmogorov complexity is a nearly perfect measure of the algorithmic complexity of an object and the Solomonoff-Levin distribution gives greater importance to objects simple relatively to this measure, (i.e. associated with a low Kolmogorov complexity). Our purpose is to provide a safe basis for heuristics used in natural and machine learning by making use of this complexity notion, and to define formal models of learning in which they are theoretically justified. Our main results are the following :

Natural language learning

Natural language learning is a universal human process, based on positive data only (i.e. syntactically correct sentences). But, from a theoretical point of view, natural languages are at least context-free and this class is not learnable from positive data in usual  learning models. To overcome this paradox, various solutions have been proposed : for example, following the chomskian intuition, it can be admitted that natural languages belong to a restricted family whose structural properties are innately known by the human mind, or that the examples are provided by a helpful teacher making the learning possible. Our idea is to explore another possible explanation, taking into account semantic information attached to the examples. We have developped a syntactico-semantic model of language learning based on the Principle of Compositionality, which precisely describes the link between syntax and semantics. This model has been examplified in the context of various kinds of Categorial Grammars, associated with logical sematics. The main results obtained are :

Decision trees and unlabeled examples :

In many learning problems, labeled examples are rare or expensive while numerous unlabeled and positive examples are available. Thus we address the problem of learning with the help of positive and unlabeled data given a small number of labeled examples. We find both theoretical and empirical arguments showing that learning algorithms can be improved by the use of both unlabeled and positive data. We give theoretical results for the improvement of Statistical Query learning algorithms from positive and unlabeled data. We apply these ideas to tree induction algorithms. We modify the code of c4.5 to get an algorithm which takes as input a set LAB of labeled examples, a set POS of positive examples and a set UNL of unlabeled data and which uses these three sets to construct the decision tree (Decomité et al 99).


List of publications and reports :

International Journal :

International Conferences


Tree Automata :

Rémi Gilleron and Marc Tommasi are also active researchers in the domain of rewriting techniques and software analysis. They work on set constraint resolution and on the theory of rewriting terms. They also took part in the writing of a collective book on tree automata and their applications. Main references on the subject are :