Identifying duplicates of web pages is a nice practical problem. Obviously, two pages are considered duplicated ('the same') if they carry the same content. The problem is that from the computer's point of view, pages are the same if they exactly match byte-by-byte. However, from the user's point of view, pages are the same if they carry the same content; he ignores irrelevant parts such as advertisements and rss-feeds included.
Since we are not interested in deciding how the documents differ (plagiarism detection for example), it is benefical to work only with hash values of page contents - fingerprints. For exact-duplicate matching, a simple match of fingerprints suffices and therefore a common hash function is suitable. In contrast for near-duplicate matching, while still distributing documents evenly, the hash function has to map similar documents close together.
One of such functions is Charikar's hash [1] (my favourite), also refered as simhash
[2].
It works well even for small (64bits) fingerprints: see the evaluation [2,3].
The calculation of the hash is performend in the following way:
V
of integers is initialized to 0
, length of the vector corresponds to the desired hash size in bitsh
), vector V
is updated:
h
is 0
, otherwiseh
is 1
V
corresponds to the bits of the final fingerprintHere is available my experimental implementation and you are welcome to use it for your own experiments ;). It uses Jenkin's hash function.