Je parle français ! FR

Jérôme Champavère – PhD in Computer Science



PhD Thesis

Title. Schema-Guided Query Induction.

Abstract. XML is a generic data description language originally designed for storing, processing and exchanging information on the Internet. It has raised as a standard for database, document or Web communities, and it is used in numerous applications nowadays. The data format processed by the latter is usually specified by an XML schema. This is a meta-description that constrains the structure of XML documents and their data type.

Querying documents in order to extract information they contain is an essential task. Node selecting queries are for instance the basis for transforming XML documents. However, most existing tools for defining queries over XML documents require technical skills from the user. In contrast, inductive query learning is a way of designing information extraction tasks without any prior knowledge. In such a system, the user annotates some example documents with a graphical interface. A learning algorithm is then used in order to infer the query.

In this thesis, we suggest to use the knowledge provided by XML schemas into query induction algorithms based on grammatical inference techniques. As regular tree languages, schemas can be easily represented by tree automata. Thus their use is especially appropriate to automata inference algorithms. We have distinguished two of them.

  1. The first idea is to force inferred queries to be consistent with the schema. For this purpose, we have designed an efficient inclusion test in deterministic factorized tree automata, a model of automata we have defined in order to represent DTDs in a compact manner.
  2. The second idea is that information contained in XML schemas might be useful for tree pruning heuristics. Pruning is necessary when processed documents are pretty large and/or partially annotated. The counterpart is that some regular queries cannot be inferred anymore. We give a characterization of the class of queries that can be learned from a sample of pruned annotated trees, namely stable queries.

We have implemented and tested our schema-guided query induction algorithms. The developed system enables to simulate the user behavior when defining a new query. The results of our experiments supports the relevance of our approach. They indeed show that schema-guidance improves the learning process.

Keywords: queries, XML schemas, grammatical inference, trees, automata.



International Journals

International Conferences

International Workshops



Attended Events

Conferences and Workshops

Summer Schools

Last modified: 11 May 2013, 20:31.