An LSH-based model-words-driven product duplicate detection method

Hartveld, Aron; van Keulen, Max; Mathol, Diederik; van Noort, Thomas; Plaatsman, Thomas; Frasincar, Flavius; Schouten, Kim

doi:10.1007/978-3-319-91563-0_25

Hartveld, A. (Aron), van Keulen, M. (Max), Mathol, D. (Diederik), van Noort, T. (Thomas), Plaatsman, T. (Thomas), F. Frasincar (Flavius) and K.I.M. Schouten (Kim)

2018

An LSH-based model-words-driven product duplicate detection method

The online shopping market is growing rapidly in the 21st century, leading to a huge amount of duplicate products being sold online. An important component for aggregating online products is duplicate detection, although this is a time consuming process. In this paper, we focus on reducing the amount of possible duplicates that can be used as an input for the Multi-component Similarity Method (MSM), a state-of-the-art duplicate detection solution. To find the candidate pairs, Locality Sensitive Hashing (LSH) is employed. A previously proposed LSH-based algorithm makes use of binary vectors based on the model words in the product titles. This paper proposes several extensions to this, by performing advanced data cleaning and additionally using information from the key-value pairs. Compared to MSM, the MSMP+ method proposed in this paper leads to a minor reduction by 6% in the F1-measure whilst reducing the number of needed computations by 95%.

Additional Metadata
Keywords	Duplicate detection, Locality sensitive hashing, Min-hashing, Multi-component similarity method, Web shop products
Persistent URL	doi.org/10.1007/978-3-319-91563-0_25, hdl.handle.net/1765/108863
Series	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Rights	no subscription
Organisation	Erasmus University Rotterdam
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Hartveld, A. (Aron), van Keulen, M. (Max), Mathol, D. (Diederik), van Noort, T. (Thomas), Plaatsman, T. (Thomas), Frasincar, F., & Schouten, K. (2018). An LSH-based model-words-driven product duplicate detection method. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). doi:10.1007/978-3-319-91563-0_25

An LSH-based model-words-driven product duplicate detection method

Publication

Publication

About

An LSH-based model-words-driven product duplicate detection method

Publication

Publication

Workflow

Workflow

Add Content