ASU Logo and Word Mark in MaroonArizona State University - Cognitive Information Processing Lab, Department of CSE

 HomeResearch ActivitiesPeople in the LabCoursesPublications

 

 

 

 

 

DataRover: Enabling Automated Data Extraction And Integration Of Taxonomy-Based Data-Intensive Web Sites

 

Team

Hasan Davulcu

Saravanakumar Nagarajan

Viswanathan Ramachandran

Dipti Aswath

Chiranjeevi Jaladi

 

Abstract

The advent of e-commerce has created a trend that brought thousands of catalogs online.  Most of these Web sites are “taxonomy-directed'” and “data-intensive'”. A Web site is said to be “taxonomy-directed” if it contains at least one taxonomy for organizing its contents. A “data-intensive” Web site presents its taxonomies and instances in a regular fashion. This paper describes the DataRover system, which can automatically crawl and extract all products from taxonomy-directed data-intensive online catalogs. DataRover is based on pattern mining algorithms and domain specific heuristics which utilize the navigational and presentation regularities to identify taxonomy, list-of-product and single-product segments within an online catalog. Next, it uses the inferred patterns to extract data from all such data segments and to automatically turn an online catalog into a database of categorized products. We also provide experimental results to demonstrate the efficacy of the DataRover.

 

The DataRover system can automatically crawl and extract products from taxonomy-directed data-intensive online catalogs. DataRover is based on pattern mining algorithms and domain specific heuristics which utilize the navigational and presentation regularities.

 

CIPS Internal Link

 

 

 

 

  

 

Computer Science and Engineering

Ira A. Fulton School of Engineering