SQuIRReL: A Firefox Extension for Information Extraction

Presentation

Squirrel (Structural Query Induction relying on Regular Languages) is a firefox extension which aims to easily generate information extractor on the web using machine learning techniques.

Once generated, each information extractor can automatically fetch information on a web site: for example headlines in a news site, products in a merchant site, or any other specific information in a set of similar pages.

Current state

Squirrel is in early developpement stage and currently not for end-users, just for testing and research purposes.

XML/RSS generation is not available, extracted information is just highlightened, and only single elements can be extracted.

Description

The extractor is generated in an interactive way:

  1. The user specify some elements in a the web page as selected, some other (optionnaly) as unselected.
  2. The user lauch Squirrel's learning procedure.
  3. Squirrel learn an extractor, test it on the web page, and highlight all extracted elements.
  4. If the extraction is correct, the extractor can be used on other pages. Otherwise, the user go back to step 1 and give more information.

Details of the algorithms are available here.

Installation

Squirrel a Firefox extension, but needs an external executable, only available for Linux. It requires:

Then you can install:

Comments: