The online shopping market is growing rapidly in the 21st century, leading to a huge amount of duplicate products being sold online. An important component for aggregating online products is duplicate detection, although this is a time consuming process. In this paper, we focus on reducing the amount of possible duplicates that can be used as an input for the Multi-component Similarity Method (MSM), a state-of-the-art duplicate detection solution. To find the candidate pairs, Locality Sensitive Hashing (LSH) is employed. A previously proposed LSH-based algorithm makes use of binary vectors based on the model words in the product titles. This paper proposes several extensions to this, by performing advanced data cleaning and additionally using information from the key-value pairs. Compared to MSM, the MSMP+ method proposed in this paper leads to a minor reduction by 6% in the F1-measure whilst reducing the number of needed computations by 95%.

, , , ,
doi.org/10.1007/978-3-319-91563-0_25, hdl.handle.net/1765/108863
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Erasmus University Rotterdam

Hartveld, A. (Aron), van Keulen, M. (Max), Mathol, D. (Diederik), van Noort, T. (Thomas), Plaatsman, T. (Thomas), Frasincar, F., & Schouten, K. (2018). An LSH-based model-words-driven product duplicate detection method. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). doi:10.1007/978-3-319-91563-0_25