GRAPPA : Groupe de Recherche sur l'Apprentissage Automatique

Research themes

The current research topics are:
  1. machine learning for tree structured data and documents: The aim of Mostrare is to integrate tree structures and emerging machine learning techniques into adaptive information extraction systems. We develop machine learning algorithms adapted to tree structured data. We wish to develop automatic or semi-automatic procedures for inducing tree transformations. We would integrate these algorithms into innovative information extraction systems, semantic Web platforms, and document processing engines. We use grammatical inference techniques and statistical learning techniques.
  2. reinforcement learning: SequeL aims at studying sequential Learning. We consider various tasks of machine learning, namely in supervised learning (regression, functional prediction), in unsupervised learning (clustering, and functional clustering), and in reinforcement learning (sequential decision processes). Our basic tool is statistical learning. We consider mainly:
    • non parametric sparse function approximation
    • non parametric Bayesian methods

    Past research topics

    Rémi Gilleron and Marc Tommasi were also active researchers in the domain of rewriting techniques and software analysis. They took part in the writing of a collective book on tree automata and their applications (see TATA).

    Learning from positive and unlabeled data

    In many machine learning settings, labeled examples are difficult to collect while unlabeled data or examples from some particular class (that we call the positive class) are abundant. Can these additional data be used to improve accuracy of supervised learning algorithm? We investigate the design of learning algorithms with the help of positive data and unlabeled data. Many machine learning and data mining algorithms, such as decision tree inference algorithms, only use labeled examples in order to evaluate statistical queries (SQ-like algorithms). Kearns designs the SQ learning model in order to describe these algorithms. In (Decomite & al), we have given evidence -- with both theoretical and empirical arguments -- that positive data and unlabeled data can boost accuracy of SQ-like learning algorithms. It was noticed that learning with positive and unlabeled data is possible as soon as the weight of the target concept (i.e. the ratio of positive examples) is known by the learner. An estimate of the weight can be obtained either by an extra-oracle (say for a similar problem) or from a small set of labeled examples. In (Letouzey & al 00), we have considered the more general problem where only positive data and unlabeled data are available. We present a general scheme which transforms any SQ-like supervised learning algorithm L into an algorithm PL using only positive data and unlabeled data. We prove that PL is a learning algorithm as soon as the learner is provided with a lower bound on the weight of the target concept. It remains open whether it is possible to design an algorithm from positive data and unlabeled data from any SQ learning algorithm in the general case. We design text classification algorithms from positive and unlabeled data. More precisely, we design a naive Bayes algorithm from positive and unlabeled data. We also extend the co-training framework.

    Grammatical Inference

    The subject of grammatical inference is the inference of formal languages represented by grammars or automata from example. Areas of application include all cases where formal language can describe a practical situation : natural language processing, voice recognition, text mining, pattern recognition in DNA...

    Our work deals mainly with the inference of regular languages. This class of language is the simplest in Chomsky hierarchy, but there are still a lot of difficulties to overcome from the machine learning point of view; as regular languages can be non trivial approximation of real situation, and as every work on most complex class of language imply that this level of language has to be well understood, many works of grammatical inference deals on this thema.

    We deal with this problem from two main axis :

    • representation of language. In most works, regular languages are represented only using deterministic automata. One of the consequences of this choice is that it excludes languages for which this representation is complex. We defined a class of non deterministic automata (residual finite states automata) that allows to enlarge the class of languages efficiently learnable.
    • learning from positives data only. An hypothesis in natural language learning is that the learning process has to be able to work with syntaxically correct sentences only. But classical learning model can't take that into account. We study the hypothesis that the distribution behind the language contains information on both the syntax and the semantic of sentences.

    Natural language learning

    Natural language learning is a universal human process, based on positive data only (i.e. syntactically correct sentences). But, from a theoretical point of view, natural languages are at least context-free and this class is not learnable from positive data in usual learning models. To overcome this paradox, various solutions have been proposed : for example, following the chomskian intuition, it can be admitted that natural languages belong to a restricted family whose structural properties are innately known by the human mind, or that the examples are provided by a helpful teacher making the learning possible. Our idea is to explore another possible explanation, taking into account semantic information attached to the examples. We have developed a syntactico-semantic model of language learning based on the Principle of Compositionality, which precisely describes the link between syntax and semantics. This model has been exemplified in the context of various kinds of Categorial Grammars, associated with logical semantics. The main results obtained are :
    • Syntactic analysis with categorial grammars : we have proposed a new algorithm for the syntactic analysis with categorial grammars, and study its complexity (Finkel & Tellier 96).
    • Syntactico-semantic learning : We have defined a new learning model in which both syntax and semantic information are used. This model can be adapted to special cases of categorial grammars (Tellier 98a, 98b, 98c) or to any linguistic formalism satisfying the Principle of Compositionality (Tellier 99). This model gave rise to a software implementation to be still improved, taking into account the heuristics inspired by the formal models of learning from simple examples and from positive examples defined in previous works (see below).

    Data Mining in medicine

    The Grappa team is involved in two research projects in data mining in medicine : cardiovascular risk and diabetes.
    • diabetes: We are involved in a KDD (Knowledge Discovery in Databases) project. A historical set of records describes diabetic patients. The task is to predict complications. In the Data Mining phase, we must find an appropriate machine learning method that satisfies two constraints:
      • a diabetic patient may have either no complication, or one complication, or more than one complication. Therefore our problem is a multi-label classification problem in which each example is associated with a possibly empty set of labels;
      • the readability of the induced procedure is an important property for medical problems. Therefore, the outcome of our algorithm must be intelligible.
      We have extended the alternating decision tree learning algorithm of (Freund and Mason 99) to the case of multi-label problems and have applied the algorithm to the diabetes data. See (Decomite & al 01).
    • Cardiovascular risk: the project starts.

    Algorithmic learning theory

    In Machine Learning, heuristics are usually justified in terms of algorithmic complexity (principle of Occam Razor, Minimal Description Length, minimization of the empirical risk). On the other hand, in natural learning, teachers seem to give a higher priority to simple or highly informative examples. The Kolmogorov complexity is a nearly perfect measure of the algorithmic complexity of an object and the Solomonoff-Levin distribution gives greater importance to objects simple relatively to this measure, (i.e. associated with a low Kolmogorov complexity). Our purpose is to provide a safe basis for heuristics used in natural and machine learning by making use of this complexity notion, and to define formal models of learning in which they are theoretically justified. Our main results are the following :
    • Definition of formal models : We have defined two extensions of the Probably Approximately Correct (PAC) model of Valiant. The first one is the learning model from simple examples, in which the examples are provided according to the Solomonoff-Levin distribution relatively to the target (Denis et al 96). The second one is the learning model with a helpful teacher, in which the only admitted distributions of examples are the ones assigning a non null probability to the members of a characteristic sample set associated with the target (Denis & Gilleron 97), (Denis & al 97), (Denis & Gilleron 01)
    • Learning from positive examples : In natural situations, it is often noticed that the learner has to content himself with positive data. In this context, it is hard to explain how human learners avoid over-generalization. We have defined an extension of the Valiant's PAC model dedicated to the learning from positive examples. In this model, the learner receives both positive and unlabeled examples from an oracle, and uses the unlabeled data to infer information about the underlying distribution. This model can also be extended to learning by queries (Denis 98) and is applied in the domain of learning by decision trees described further.

    (Dernière mise à jour de ce descriptif : 25/08/2010.)

    Responsable de ces pages : D.Gonzalez (dominique point gonzalez à univ-lille3 point fr)
    apache   penguin   php Valid HTML 4.01! Valid CSS! Best viewed with any borowser, Optimisé pour tous les
navigateurs