GRAPPA : Groupe de Recherche sur l'Apprentissage Automatique
Research themes
-
Grappa is the machine learning research group of the computer science
department at the Universities of Lille. Grappa is funded jointly by
the French research ministry, INRIA, and CNRS.
- The group is now composed of two teams. Please see their Web pages to get up-to-date informations on GRAppA:
MOSTRARE or SEQUEL
The current research topics are:
- machine learning for tree structured data and documents:
The aim of Mostrare is to integrate tree structures and emerging
machine learning techniques into adaptive information extraction
systems. We develop machine learning algorithms adapted to tree structured data. We wish to develop automatic or semi-automatic
procedures for inducing tree transformations. We would integrate
these algorithms into innovative information extraction
systems, semantic Web platforms, and document processing
engines. We use grammatical inference techniques and statistical learning techniques.
- reinforcement learning:
SequeL aims at studying sequential Learning. We consider various tasks of machine learning, namely in supervised learning (regression, functional prediction), in unsupervised learning (clustering, and functional clustering), and in reinforcement learning (sequential decision processes).
Our basic tool is statistical learning. We consider mainly:
- non parametric sparse function approximation
- non parametric Bayesian methods
Past research topics
Rémi Gilleron and Marc Tommasi were also active researchers in the domain of
rewriting techniques and software analysis. They took part in the writing of
a collective book on tree automata and their applications (see TATA).
Learning from positive and unlabeled data
In many machine learning settings, labeled examples are difficult to
collect while unlabeled data or examples from some particular class
(that we call the positive class) are abundant. Can these
additional data be used to improve accuracy of supervised learning
algorithm? We investigate the design of learning
algorithms with the help of positive data and unlabeled data. Many machine
learning and data mining algorithms, such as decision tree inference
algorithms, only use labeled examples in order to evaluate statistical
queries (SQ-like algorithms). Kearns designs the SQ learning model
in order to describe these algorithms. In (Decomite & al), we
have given evidence -- with both theoretical and empirical arguments
-- that positive data and unlabeled data can boost accuracy of SQ-like
learning algorithms. It was noticed that learning with positive and
unlabeled data is possible as soon as the weight of the target concept
(i.e. the ratio of positive examples) is known by the learner. An
estimate of the weight can be obtained either by an extra-oracle (say
for a similar problem) or from a small set of labeled examples.
In (Letouzey & al 00), we have considered the more general problem where only
positive data and unlabeled data are available. We present a general
scheme which transforms any SQ-like supervised learning algorithm L
into an algorithm PL using only positive data and unlabeled data. We
prove that PL is a learning algorithm as soon as the learner is
provided with a lower bound on the weight of the target concept. It
remains open whether it is possible to design an algorithm from
positive data and unlabeled data from any SQ learning algorithm in the
general case.
We design text classification algorithms from positive and unlabeled data.
More precisely, we design a naive Bayes algorithm from positive and unlabeled data.
We also extend the co-training framework.
Grammatical Inference
The subject of grammatical inference is the inference of formal languages represented by grammars or automata from example. Areas of application include all cases where formal language can describe a practical situation : natural language processing, voice recognition, text mining, pattern recognition in DNA...
Our work deals mainly with the inference of regular languages. This class of language is the simplest in Chomsky hierarchy, but there are still a lot of difficulties to overcome from the machine learning point of view; as regular languages can be non trivial approximation of real situation, and as every work on most complex class of language imply that this level of language has to be well understood, many works of grammatical inference deals on this thema.
We deal with this problem from two main axis :
- representation of language. In most works, regular languages are represented only using deterministic automata. One of the consequences of this choice is that it excludes languages for which this representation is complex. We defined a class of non deterministic automata (residual finite states automata) that allows to enlarge the class of languages efficiently learnable.
- learning from positives data only. An hypothesis in natural language learning is that the learning process has to be able to work with syntaxically correct sentences only. But classical learning model can't take that into account. We study the hypothesis that the distribution behind the language contains information on both the syntax and the semantic of sentences.
Natural language learning
Natural language learning is a universal human process, based on positive
data only (i.e. syntactically correct sentences). But, from a theoretical
point of view, natural languages are at least context-free and this class
is not learnable from positive data in usual learning models. To overcome
this paradox, various solutions have been proposed : for example, following
the chomskian intuition, it can be admitted that natural languages belong
to a restricted family whose structural properties are innately known by
the human mind, or that the examples are provided by a helpful teacher making
the learning possible. Our idea is to explore another possible explanation,
taking into account semantic information attached to the examples.
We have developed a syntactico-semantic model of language learning based
on the Principle of Compositionality, which precisely describes the link
between syntax and semantics. This model has been exemplified in the context
of various kinds of Categorial Grammars, associated with logical semantics.
The main results obtained are :
-
Syntactic analysis with categorial grammars : we have proposed a new
algorithm for the syntactic analysis with categorial grammars, and study
its complexity (Finkel & Tellier 96).
-
Syntactico-semantic learning : We have defined a new learning model
in which both syntax and semantic information are used. This model can be
adapted to special cases of categorial grammars (Tellier 98a, 98b, 98c) or
to any linguistic formalism satisfying the Principle of Compositionality
(Tellier 99). This model gave rise to a software implementation to be still
improved, taking into account the heuristics inspired by the formal models
of learning from simple examples and from positive examples defined in previous
works (see below).
Data Mining in medicine
The Grappa team is involved in two research projects in data mining in medicine :
cardiovascular risk and diabetes.
- diabetes:
We are involved in a KDD (Knowledge Discovery in Databases) project. A
historical set of records describes diabetic patients. The task is to
predict complications. In the Data Mining
phase, we must find an appropriate machine learning method that
satisfies two constraints:
- a diabetic patient may have either no complication, or one complication,
or more than one complication. Therefore our problem is a multi-label
classification problem in which each example is associated with a
possibly empty set of labels;
- the readability of the induced procedure is an important
property for medical problems. Therefore, the outcome of our
algorithm must be intelligible.
We have extended the alternating decision tree
learning algorithm of (Freund and Mason 99) to the case of
multi-label problems and have applied the algorithm to the diabetes data.
See (Decomite & al 01).
- Cardiovascular risk: the project starts.
Algorithmic learning theory
In Machine Learning, heuristics are usually justified in terms of algorithmic
complexity (principle of Occam Razor, Minimal Description Length, minimization
of the empirical risk). On the other hand, in natural learning, teachers seem
to give a higher priority to simple or highly informative examples. The
Kolmogorov complexity is a nearly perfect measure of the algorithmic complexity
of an object and the Solomonoff-Levin distribution gives greater importance
to objects simple relatively to this measure, (i.e. associated with a low
Kolmogorov complexity). Our purpose is to provide a safe basis for heuristics
used in natural and machine learning by making use of this complexity notion,
and to define formal models of learning in which they are theoretically
justified. Our main results are the following :
-
Definition of formal models : We have defined two extensions of the
Probably Approximately Correct (PAC) model of Valiant. The first one is the
learning model from simple examples, in which the examples are provided according
to the Solomonoff-Levin distribution relatively to the target (Denis et al
96). The second one is the learning model with a helpful teacher, in which the
only admitted distributions of examples are the ones assigning a non
null probability to the members of a characteristic sample set associated
with the target (Denis & Gilleron 97), (Denis & al 97), (Denis & Gilleron 01)
-
Learning from positive examples : In natural situations, it is often
noticed that the learner has to content himself with positive data. In this
context, it is hard to explain how human learners avoid over-generalization.
We have defined an extension of the Valiant's PAC model dedicated to the
learning from positive examples. In this model, the learner receives both
positive and unlabeled examples from an oracle, and uses the unlabeled data
to infer information about the underlying distribution. This model can also
be extended to learning by queries (Denis 98) and is applied in the domain
of learning by decision trees described further.
(Dernière mise à jour de ce descriptif : 25/08/2010.)
Responsable de ces
pages : D.Gonzalez (dominique point gonzalez à univ-lille3 point fr)