PLEX86  x86- Virtual Machine (VM) Program
 CVS  |  Mailing List  |  Download  |  Newsgroups

String hashing was: Thou shalt have no other gods before the ANSI C standard 1548


Your Ad Here

Your Ad Here

CBFalconer

This is not comp.lang.c. The dead ANSI committee dogma has no audience here. (Or well it does -- from some trolling cross posters, but I don't think your blind adherence to some non-existent CLC charter should spill over into the crypto universe.)

Thou shalt have no other gods before the ANSI C standard 1552
Oh, a useless game. I am jumping in, I cannot resist. -- "I" is a poor name for a variable. Especially if the program...

The fact is that *MANY* hash functions don't work correctly at *all* unless they can have some sort of guarantee about the exact number of bits they are operating on (not some minimum number). Many also do not function correctly unless they can rely on 2s complement and-or sign extending right shift. Hashing-crypto tends to need this sort of precise specification.

That means you get to see code like those hash functions, and you deal with it as is, because nobody is paying attention to obsolete issues like this. (Or you deal with Java.)

You should search for the last time I was discussing "strings". It probably had to do with something completely different (and by now I think you can guess as to what it might be.) First of all, strings are not the only things that can be or needs to be hashed, second of all UTF-8 which is a very popular encoding of Unicode into octets *ARE* strings that can be directly hashed, thirdly the functions I tested are totally insensitive to ASCII versus non-ASCII, and finally, as you may well know, in my universe strings are just not limited to the ASCII range.

Excuse me? You can't even put someone's full name into strings that short. What are you doing? Trying to solve boggles or low score on scrabble or something? Oh I get it. You are trying to write a compiler -- where hashing is supposedly your bottleneck. *sigh*.

String hashing was: Thou shalt have no other gods before the ANSI C standard 1549
First, int it supposed to be "fast" on a given architecture, so I can see why the performance interest would look there first. Wanting to use it, and not blow up...

How are you planning to deal with rehashing, or quick verifying? You see, you *could* call your slower hash function a second time. Or you could just rip off more bits from your primary hash calculation if it outputs enough bits in the first place.

Bob Jenkins, the FNV guys, Colin Plumb and I must be so deluded ...

I've seen your hashlib implementation. Its a closed-rehash thing that is insensitive to L1 cache usage, and can't be used in hard real-time applications because it rebuilds the hash as it needs while it fills. But, I'm sure, in those cases where you are hash function limited, SFH will work just fine.

About time you actually testing various hash functions for it don't you think?

If you were doing a first cut at implementing google or m-w.com or something there might be some logic to this, but still the right answer is to use length delimited strings and completely *unroll* the hash function for each possible word size. I don't think the 31-33 hash would fair too well in such a situation.

But did you know that you can encode the *entire* english language in a "trie" data structure that fits in the L2 cache of most modern CPUs? Kind of makes the whole idea of "hashing words" seem a little silly, especially if you are concerned with performance. Why not just direct map straight from a trie?

In any event if you just want to test hash function collisions, be sure to include mixed case alphanumerics. I think the red lights on the 31-33 hash function will be pretty hard to ignore. SFH is more than ready for just about any real world data.

That's more like it -- I hear that people's full names, street addresses, and credit card numbers are an interesting application for string processing too. Just don't call strlen() repeatedly.

Typically you just read the file (or part of it) once, with consideration for how large it is relative to your memory or cache (depending on what you are testing) then just run the test on the same data over and over again. Some people like to do median of three for test results -- I prefer, *best* of three (its more stable, believe it or not.) Its not rocket science.

Good luck with it.

-- Paul Hsieh



Your Ad Here

List | Previous | Next

String hashing was: Thou shalt have no other gods before the ANSI C standard 1549

Alt Folklore Computers from Newsgroups

The #1 Usenet Provider on the Internet

String hashing was: Thou shalt have no other gods before the ANSI C standard