Multi-component similarity method for web product duplicate detection

Van Bezu, Ronald; Borst, Sjoerd; Rijkse, Rick; Verhagen, Jim; Vandic, Damir; Frasincar, Flavius

doi:10.1145/2695664.2695818

R. Van Bezu (Ronald), S. Borst (Sjoerd), R. Rijkse (Rick), J. Verhagen (Jim), D. Vandic (Damir) and F. Frasincar (Flavius)

2015-04-13

Multi-component similarity method for web product duplicate detection

Due to the growing number of Web shops, aggregating product data from the Web is growing in importance. One of the problems encountered in product aggregation is duplicate detection. In this paper, we extend and significantly improve an existing state-of-the-art product duplicate detection method. Our approach employs a novel method for combining the titles' and the attributes' similarities into a final product similarity. We use q-grams to handle partial matching of words, such as abbreviations. Where existing methods cluster products of only twoWeb shops, we propose a hierarchical clustering method to handle multiple Web shops. Applying our new method to a dataset of TV's from four Web shops reveals that it significantly outperforms the Hybrid Similarity Method, the Title Model Words Method, and the well-known TF-IDF method, with an F1 score of 0:475 compared to 0:287, 0:298, and 0:335, respectively.

Additional Metadata
Persistent URL	doi.org/10.1145/2695664.2695818, hdl.handle.net/1765/86179
Rights	no subscription
Organisation	Erasmus University Rotterdam
Citation APA APA Style APA-ALL Style AAA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Van Bezu, R., Borst, S., Rijkse, R., Verhagen, J., Vandic, D.& Frasincar, F. (2015, April 13). Multi-component similarity method for web product duplicate detection.https://doi.org/10.1145/2695664.2695818

Multi-component similarity method for web product duplicate detection

Publication

Publication

About

Multi-component similarity method for web product duplicate detection

Publication

Publication

Workflow

Workflow

Add Content