|
DataRover:
Enabling Automated Data Extraction And Integration Of Taxonomy-Based
Data-Intensive Web Sites
Team
Hasan Davulcu
Saravanakumar Nagarajan
Viswanathan Ramachandran
Dipti Aswath
Chiranjeevi Jaladi
Abstract
The
advent of e-commerce has created a trend that brought thousands of catalogs
online. Most of these Web sites
are “taxonomy-directed'” and “data-intensive'”. A
Web site is said to be “taxonomy-directed” if it contains at
least one taxonomy for organizing its contents. A
“data-intensive” Web site presents its taxonomies and instances
in a regular fashion. This paper describes the DataRover system, which can
automatically crawl and extract all products from taxonomy-directed
data-intensive online catalogs. DataRover is based on pattern mining
algorithms and domain specific heuristics which utilize the navigational
and presentation regularities to identify taxonomy, list-of-product and
single-product segments within an online catalog. Next, it uses the
inferred patterns to extract data from all such data segments and to automatically
turn an online catalog into a database of categorized products. We also
provide experimental results to demonstrate the efficacy of the DataRover.
The
DataRover system can automatically crawl and extract products from
taxonomy-directed data-intensive online catalogs. DataRover is based on
pattern mining algorithms and domain specific heuristics which utilize the
navigational and presentation regularities.
CIPS Internal Link
|