Dataset formats
In each dataset, all documents are annotated.
There are two kind of annotations :
- slots instances
- tuples instances
A slot instance is a leaf node. It's father have the special attribute slot with the name of the slot as value. A tuple instances is a tuple of leaf nodes. The special attribute tuple give the tuple instance of a slot instance.
XHTML RISE
annotated XHTML version of bigbook, okra, iaf, s20 and s1
| Dataset | n-ary relation |
|---|---|
| bigbook | (name,address) |
| okra | (name,mail,score) |
| s1 | (product,company,price) |
| s20 | (file,size,type,scrore) |
| iaf | (name,mail,lastupdate,organization,altname,provider) |
DataFoot
annotated XHTML benchmarks with different layouts
| Dataset | n-ary relation |
|---|---|
| L0 to L9 | (club,season,round,date) |
| CT | (club1,club2,wins) |
- L0 to L9 (without missing values)
- L0 to L9 (with missing values for the slot date)
- CT (cross tables)
- archive
Benchmarks from real web sites
annotated XHTML benchmarks
| Dataset | n-ary relation | Web Site |
|---|---|---|
| BLS | (year,quantile,value) | Bureau of Labor Statistics |
| Excite Weather | (town,day,forecast,high,low) | Excite Weather |
