| PLEX86 | ||
|
String hashing was: Thou shalt have no other gods before the ANSI C standard 1550
Thou shalt have no other gods before the ANSI C standard 1552 Oh, a useless game. I am jumping in, I cannot resist. -- "I" is a poor name for a variable. Especially if the program also (that's not strictly necessary, but... Tim Peters ... Thou shalt have no other gods before the ANSI C standard 1553 On Sat, 26 Feb 2005 02:35:33 GMT in alt.folklore.computers, Randy That fact is recognized in the established professions, where you have to article, or the engineering... I wouldn't know, since I haven't used either of those. I don't claim that it's not true.
I've been on the receiving end of hash bug reports against Python since 1991, and am sharing my experience with the hash functions Python has used. It's OK by me if you discount that, but I don't really "get" the strident tone here. String hashing is extremely common in all Python apps, because of the way the interpreter is implemented, and because string-keyed dictionaries (hash tables) are so heavily used in typical Python apps. That's a world of experience to draw on. Good! Python's string hash has been subjected to extensive statistical testing too, of course.
is that, for example, sequences of similar strings like "x0", "x1", "x2", ... produce results like this: ... s = "x" + str(i) ... i = hash(s) ... print i & 0xf, # show the last 4 bits ... 10 11 8 9 14 15 12 13 2 3 or ... print hash(ch) & 0x1f, ... 0 3 2 5 4 7 6 9 8 11 10 13 12 15 14 17 16 19 18 21 20 23 22 25 24 27 No collisions. This is of great practical worth, because Python uses hash tables with indices that small, resolving a collision is significantly more expensive than hitting a key's slot on the first try, and strings often follow patterns exactly that simple. I'm not claiming this is hard to achieve, BTW. Across all possible inputs, of course not. Thou shalt have no other gods before the ANSI C standard 1555 not to mention plain text in various languages other than English. And another that uses the lowercase o for the number... No. Can't comment -- I don't know anything about your hash. If it's as good as you say, and can be coded in 100% portable C (including making no buttumptions about native word size or alignment), and is open source, the Python project may want to use it. The multiplication in Python's current hash is uncomfortably expensive on some platforms (especially on teesny embedded processors that lack HW integer multiply), so a faster hash would be welcome. Ditto. Thou shalt have no other gods before the ANSI C standard 1554 I think you missed some of the bigger fish, as other responses should have made clear. It was an buttignment to learn about hashing, which is used in performance critical code quite often. This... I've never tested a string hash with a better combination of properties for Python's needs than the one Python currently uses. That's why Python uses behavior stopped after the current method was adopted in 1996, there's been no urgency to look for better methods. String hashing was: Thou shalt have no other gods before the ANSI C standard 1551 Tim Peters But if you've never seen the Bob Jenkin's hash before, or the other functions he examined, then you've still only seen... One person flamed on about how good a string hash CRC was, and that's the last time I burned significant time on doing string-hash testing (around 2000 -- and the CRC-based string hash was a comparative disaster in head-to-head tests). buttuming "your hash" is the one named SuperFastHash on I don't think FNV's one-line inner loop could sanely be called "very hash = (hash * CONSTANT) ^ *p++; in the inner loop then you were looking at a micro-optimized version; Python's string hash loop body is of exactly that form. BTW, sizeof(char) is guaranteed to be 1 on all platforms, so if the link above is your function, you could collapse the "2*sizeof(char)" thingies to plain "2" without losing anything (yes, a good compiler will do that for you). I clearly can't agree that FNV is a complex approach. Python's needs to produce a C long, regardless of what "long" means on the target platform. On current platforms, that's always 32 or 64 bits, although the current implementation of string hashing doesn't know which, and doesn't care.
|
||||||||