Detection of near-duplicate documents

Similarity

Identifying duplicates of web pages is a nice practical problem. Obviously, two pages are considered duplicated ('the same') if they carry the same content. The problem is that from the computer's point of view, pages are the same if they exactly match byte-by-byte. However, from the user's point of view, pages are the same if they carry the same content; he ignores irrelevant parts such as advertisements and rss-feeds included.

Since we are not interested in deciding how the documents differ (plagiarism detection for example), it is benefical to work only with hash values of page contents - fingerprints. For exact-duplicate matching, a simple match of fingerprints suffices and therefore a common hash function is suitable. In contrast for near-duplicate matching, while still distributing documents evenly, the hash function has to map similar documents close together.

Charikar's hash

One of such functions is Charikar's hash [1] (my favourite), also refered as simhash [2]. It works well even for small (64bits) fingerprints: see the evaluation [2,3].

The calculation of the hash is performend in the following way:

Download

Here is available my experimental implementation and you are welcome to use it for your own experiments ;). It uses Jenkin's hash function.

Source

References

  1. Charikar: Similarity Estimation Techniques from Rounding Algorithms, in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ACM Press, 2002
  2. Manku, Jain, Sarma: Detecting Near-Duplicates for Web Crawling. in Proceedings of the 16th international conference on World Wide Web, ACM Press, 2007
  3. Henzinger: Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms, in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, 2006

Viliam Holub