String hashing was: Thou shalt have no other gods before the ANSI C standard 1549
First, int it supposed to be "fast" on a given architecture, so I can see why the performance interest would look there first. Wanting to use it, and not blow up on systems with an int smaller than 32 bits (if that is a concern for production use the hash) is fixed with an #ifdef so an error will pop out on platforms with int too small. If somebody really needs a hash function for 16-bit systems, I am sure that they can dig one up, or modify one of them to work well there.
I saw several things that I would have done differently as well, but as you say, since this is not c.l.c, and it's "just a test fixture", I didn't mention them. For my purposes, if I used SFH in "real code", I'd change the calling interface and the data types, but not the core algorithm.
So, don't get too upset when someone points out such things. sizet for length would be better, if you don't use strlen inside the hash function itself (with the limitations that implies upon the type of data that can be used).
So use a typedef that will get you exactly what you need for that specific variable. I note that they seem to work okay on 64-bit CPUs, although perhaps not getting as much out of the processor as the would if they were tweaked for it.
Actually you can implement your own header to emulate stdint for implementations that don't already support it. There are several freely available (perhaps one from Gwyn IIRC) implementations that serve as a good starting point for your next portability excursion.
Actually, when you're tweaking performance, you're often stepping well off of the normal path for portability, but you can do both, it's just a lot more bother. In the long run, none of the above matters about the purpose of this exercise, to understand these hash functions better, but the stuff below gets more interesting...
Good points. It struck me funny when I first saw them taking lengths instead of terminating on '-0', but then I figured they could be used to hash just about anything in their current form as long as the data set doesn't create unexpected excess collisions.
More importantly, since it doesn't seem to cause the effectiveness of the hash any damage to make them more general, I don't see any reason to limit them to '-0' terminated strings only.
You may recall that I made a comment that having them operate on the data length in the test bed was somewhat fishy, and I'd rather see them tested on random length "strings" or "byte streams" in this case to make sure something strange didn't pop out. I would expect a lot of applications to be hashing lots of variable length items, some long some short. Here's a simple example...
You want to write some sort of online adventure game, and you have a bazillion strings to hash. Being a lazy gamer, you want to have only one hash table for it all. :-) Room descriptions, monster names, descriptions, characters, weapons, long descriptions of weapons, special "magic" event descriptions, and a slew of commands, etc. Some of those room descriptions might be upwards of 800 bytes, but then you have a slew of short strings like "n", "kill", "fireball", "sleep", "up", plus a lot of random length stuff in between.
For something like that, you would definitely want to make sure you had a hash that a good average performance across both the short and the long. Because of the way its done, SFH may be slower internally on short strings, but the other issue is of course collisions. Obviously, if everything hashes to a few buckets and leaves the rest empty, the speed of the hash function isn't going to be the issue.
I saw that very thing recently in an app where a programmer had rolled his own hash function, and rolled it poorly, such that in a 1031 element table, he managed to put 80% of the items in the first 16 entries, with a few scattered "hotspots" through the rest of the table, and mostly 0 or 1 items in each of the remaining ones. After instrumenting it (because it looked suspicious just from the code itself) I found this out. Bucket 0 had over 5000 items in its chain, while the average number of strings per bucket was around 10. That's certainly an exceptionally bad case, and I'd like to hope that none of the hash functions being discussed in this test program are anywhere near that bad.
Someone asked me elsethread recently why I wanted to "see it for myself" instead of just believing what "everybody else thought". These real world examples are exactly why I feel that way. Plus, I don't think any of the proponents of 31-33 or FNV earlier in thread expected SFH to do so well, or for the collision concerns to come up.
If you follow this part of CBF's argument, and admit that SFH loses on short strings, then he has a point, because you have to make sure that the hash chosen for an application is correct for your data set.
So, being in want of empirical data, I decided to change your fixture to test short strings. I changed BUFFSZ to 16 and triples NTESTS to 30000000 to make it run longer for hopefully more accurate measurements. I also changed the loop variable in test(hashfn) from an int to a sizet as NTESTS kept getting increased during this process. I couldn't stop myself. :-) It had no impact upon the measurements BTW.
String hashing was: Thou shalt have no other gods before the ANSI C standard 1550
Tim Peters ... I wouldn't know, since I haven't used either of those. I don't claim that it's not true. I've been...
SFH was still faster than 31-33 by about 20%, followed by the rest. All of them are reasonably fast in this case (there's just not much work to do), but SFH is much faster than the worst performers, 1atatime and Bob Jenkins in which case it is 2-3X faster.
So, I decided to go even shorter to figure out what CBF was looking at. 31-33 finally came out ahead. Here are the timings with BUFFSZ = 8 and NTESTS = 60000000 ...
CRC32 : 1.923s oneAtATimeHash : 3.132s FNVHash : 2.308s 31-33 : 1.484s BobJenkins : 2.68 plus 12s SuperFastHash : 1.648s
Thou shalt have no other gods before the ANSI C standard 1555
not to mention plain text in various languages other than English. And another that uses the lowercase o...
I haven't tried to determine anything about collisions yet, but even so, given the more general case where SFH is considerably faster (and it never does worse than second for any fixed length test that I've tried so far), I find it hard to support an argument for 31-33 solely because it works better for strings that are only 8 chars long.
Note that even if the string length is raised to only 12 chars, then 31-33 starts to be outrun by SFH. However, this reminded me of something I noticed early on but forgot about as this went along. SFH has to handle what Paul calls in his comments "end cases", when len & 3 != 0 So, I decided to use BUFFSZ=15 and 13. Even so, SFH is convincingly faster. Same for 257. So much for that.
It would more interesting to run much longer tests with random string contents and random length strings and see how they stand up. In terms of both performance, and simulating storing them into a fixed sized table (to look at chain length distributions). That's the part of this that needs to be explored further. In terms of straightline end performance, it's pretty clear from the above that SFH has it, at least on x86 family hardware. It would be interesting to see somebody try it on a Sparc, or even a DEC system with that inherently evil bit count. :-)
I'd be interested in the results, but it's a bit of an apples and oranges deal I believe. I'd much rather see instrumentation to simulate storing the entries to see how it does with various table sizes for chain lengths rather than watering it down with all the extra overhead, resizing, etc.
Now we're getting to the good part. This I very much want to see. Stress testing of hash functions. Evil data sets. :-)
Or, if you really want all the data, just stop the clock every time you come up for air to read more data from the file. Maybe I'm missing something, but that doesn't seem like too big a deal, you'll just have to use different timing functions. I'm not convinced that clock() is always all that accurate anyway. For this sort of thing on an unloaded system, wall clock time seems as good or better.
-- Randy Howard (2reply remove FOOBAR) "Making it hard to do stupid things often makes it hard to do smart ones too." -- Andrew Koenig