A hybrid model words-driven approach for Web product duplicate detection

de Bakker, Jacques; Frasincar, Flavius; Vandic, Damir

doi:10.1007/978-3-642-38709-8_10

The detection of product duplicates is one of the challenges that Web shop aggregators are currently facing. In this paper, we focus on solving the problem of product duplicate detection on the Web. Our proposed method extends a state-of-the-art solution that uses the model words in product titles to find duplicate products. First, we employ the aforementioned algorithm in order to find matching product titles. If no matching title is found, our method continues by computing similarities between the two product descriptions. These similarities are based on the product attribute keys and on the product attribute values. Furthermore, instead of only extracting model words from the title, our method also extracts model words from the product attribute values. Based on our experimental results on real-world data gathered from two existing Web shops, we show that the proposed method, in terms of F1-measure, significantly outperforms the existing state-of-the-art title model words method and the well-known TF-IDF method.

Additional Metadata
Keywords	attribute distance, entity resolution, model words, products
Persistent URL	doi.org/10.1007/978-3-642-38709-8_10, hdl.handle.net/1765/40749
Journal	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Organisation	Erasmus School of Economics
Citation APA APA Style APA-ALL Style AAA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	de Bakker, J., Frasincar, F.& Vandic, D. (2013). A hybrid model words-driven approach for Web product duplicate detection. Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7908 LNCS, 149–161.https://doi.org/10.1007/978-3-642-38709-8_10

A hybrid model words-driven approach for Web product duplicate detection

Publication

Publication

About

A hybrid model words-driven approach for Web product duplicate detection

Publication

Publication

Workflow

Workflow

Add Content