PLEX86  x86- Virtual Machine (VM) Program
 CVS  |  Mailing List  |  Download  |  Newsgroups

String hashing was: Thou shalt have no other gods before the ANSI C standard 1543


Your Ad Here

Your Ad Here

Randy Howard

Yeah but rather than forcing all the other hashes to notice '-0' termination, I chose instead to force this hash function to obey length termination. I don't think it makes a really huge difference -- in fact its probably good that we did it in two different ways so nobody can claim that reason as an excuse for its poor performance.

String hashing was: Thou shalt have no other gods before the ANSI C standard
snip ... I just downloaded your test program, and I see various difficulties. First, it is sadly (and unnecessarily) lacking in portability, using ints...

Yes, I saw it. I think the point of using this 31-33 hash (which I think is due to Chris Torek, but I might be mistaken) is to *only* use the last few bits as a table index. The low value of "33" or "31" means that it has all the good properties of just packing in text characters of a single case 5 bits at a time into the output. So even up to 6 characters worth of data (so long it is all, say, lowercase text) will basically never collide in all 32bits of the hash output. So there is *something* to be said for using this hash function for some kind of inputs.

But this property has to be put into perspective. A simple thing like mixing upper, lower case, punctuation and using numbers (like you might see in a street address), and all of a sudden this hash function will start looking really bad for collisions even for sequences of less than 6 characters (or where the difference between two strings appears only in the first 6 characters). If we use 33 as the multiplier then h("BA") == h("Ab") -- now that's pretty pathetic. Compare this to SFH, Bob Jenkins, One-at-a-time, CRCs or other serious hash functions -- all produce exactly 0 collisions so long as the number of input bits is less than the output bits (i.e., usually up to 4 bytes of input). I don't know if FNV has this property (I can't tell from its construction) and CRCs are weak in that they don't avalanche as well.

With the 33-31 hash function finding these collisions is a 30 second exercise. With SFH, FNV, MD4, etc, you have to work a lot harder to find these collisions (you basically have to scan for them using some brute force mechanism), otherwise two different inputs from any typically distributed data stream will always have 1-2^b probability of colliding, where b is the number of bits used from the outputs hash (and can be as high as 32.)

String hashing was: Thou shalt have no other gods before the ANSI C standard
Tim Peters ... Randy Howard I don't know what metric(s) you're using, so "stellar" doesn't mean anything to me here. ... OK, to be tedious, it's "quite fast" among the clbutt of string hashing...

But if SFH has fewer collisions on most clbuttes of input data, and excutes a lot faster, its hard to make a credible argument for this 31-33 hash function.

String hashing was: Thou shalt have no other gods before the ANSI C standard 1544
difference. The other way just seemed easier to describe for others and was potentially less invasive to what CBF posted. I didn't go in search of collisions, I just picked a random medium length strength...

-- Paul Hsieh



Your Ad Here

List | Previous | Next

String hashing was: Thou shalt have no other gods before the ANSI C standard 1544

Alt Folklore Computers from Newsgroups

The #1 Usenet Provider on the Internet

String hashing