Due to the growing number of Web shops, aggregating product data from the Web is growing in importance. One of the problems encountered in product aggregation is duplicate detection. In this paper, we extend and significantly improve an existing state-of-the-art product duplicate detection method. Our approach employs a novel method for combining the titles' and the attributes' similarities into a final product similarity. We use q-grams to handle partial matching of words, such as abbreviations. Where existing methods cluster products of only twoWeb shops, we propose a hierarchical clustering method to handle multiple Web shops. Applying our new method to a dataset of TV's from four Web shops reveals that it significantly outperforms the Hybrid Similarity Method, the Title Model Words Method, and the well-known TF-IDF method, with an F1 score of 0:475 compared to 0:287, 0:298, and 0:335, respectively.

dx.doi.org/10.1145/2695664.2695818, hdl.handle.net/1765/86179
no subscription
Erasmus University Rotterdam

Van Bezu, R, Borst, S, Rijkse, R, Verhagen, J, Vandic, D, & Frasincar, F. (2015). Multi-component similarity method for web product duplicate detection. doi:10.1145/2695664.2695818