Dataset formats

In each dataset, all documents are annotated.

There are two kind of annotations :

  • slots instances
  • tuples instances

A slot instance is a leaf node. It's father have the special attribute slot with the name of the slot as value. A tuple instances is a tuple of leaf nodes. The special attribute tuple give the tuple instance of a slot instance.


annotated XHTML version of bigbook, okra, iaf, s20 and s1

Dataset n-ary relation
bigbook (name,address)
okra (name,mail,score)
s1 (product,company,price)
s20 (file,size,type,scrore)
iaf (name,mail,lastupdate,organization,altname,provider)


annotated XHTML benchmarks with different layouts

Dataset n-ary relation
L0 to L9 (club,season,round,date)
CT (club1,club2,wins)

Benchmarks from real web sites

annotated XHTML benchmarks

Dataset n-ary relation Web Site
BLS (year,quantile,value) Bureau of Labor Statistics
Excite Weather (town,day,forecast,high,low) Excite Weather