bloom filters using SHA1 in haskell

I recently explored an idea I read about elsewhere on the web - using SHA1 hashes to form the basis of a set of bloom filter hashes.

This works as follows: given (for example), an input string for a bloom filter, derive the SHA1 hash. this result is 160 bits long. So contained within are five 32-bit values, which we can convert to integers. These integers can be used to form the offsets inside a bloom filter data structure. As long as the collision rate for SHA1 is no worse than that for the bloom filter itself, this should work. Below I provide an example for the words in the local dictionary file, which are utf8 encoded strings.

Note that while I feel the sample below may be of interest, there is already a well-written bloom filter package on hackage that appears to be very well tested and thorough, so I have no intention of uploading this and creating confusion by suggesting this naive implementation as an alternative.

gist

last update 2012-06-29