PLEX86  x86- Virtual Machine (VM) Program
 CVS  |  Mailing List  |  Download  |  Newsgroups

Thou shalt have no other gods before the ANSI C standard 1552


Your Ad Here

Your Ad Here

Thou shalt have no other gods before the ANSI C standard 1555
not to mention plain text in various languages other than English. And another that uses the lowercase o for the number 0, and one...
Thou shalt have no other gods before the ANSI C standard 1556
I remember that one - I saw it on a lot of calendars when I was a child. Remember typewriters that didn't have a separate '1'? The convention was...

Oh, a useless game. I am jumping in, I cannot resist.

-- "I" is a poor name for a variable. Especially if the program also

(that's not strictly necessary, but doing otherwise makes little sense). Then "string" may collide with the name of a (future) standard function, which may have bad consequences (depending on how the "string" name above is defined).

-- buttuming that "string" has type "char *" or "char", values of type "char" are extracted, and may yield negative values on an architecture where "char" is signed (which is the case on most PC compilers, for instance). This happens easily if a non-ASCII string is processed, containing accentuated letters (in latin-1) or various unicode codepoints encoded in UTF-8. This means that "hashvalue" may end up with a negative value, which in turn can wreak havoc in the rest of the code if that value is used for indexing an array (as seems probable).

-- Depending on the architecture, size of int, size of char and value of TBLSIZE, the addition may yield a signed integer overflow, after which, stricto sensu, all bets are off. This is unlikely to happen on modern general-purpose computers. Some hundreds of messages in this thread have been dedicated to the question of what importance or existence the other types of architectures may have; let's not delve into it again.

-- strlen() is likely to be recomputed many times, thus making the whole algorithm quadratic (in the size of the string) instead of linear. This is lousy. One could precompute the strlen() in a local variable, and use that variable; this means that the programmer must invent another name. Another possible implementation would process the string in reverse, beginning with the string end. Thus, the strlen() would be in the initialization code, not in the test code; but walking the string in reverse may be suboptimal with regards to the processor cache "stringI != 0".

-- The modulus operation may be slow. Depending on TBLSIZE, and the size of int and char, most of these modulus may be merged together. This however can be some tricky business, if the solution must be portable on a wide variety of architectures.

-- Since C mandates "truncation towards 0", modulus and division by constants are a bit more complicated on signed integers (for instance, a division by 2 cannot be implemented with a simple shift: it yields the wrong result with some negative operands). Making "hashvalue" an unsigned number maks things easier for the compiler: we know (or at least strongly hope, provided that "string" has type "unsigned char *") that the values taken by hashvalue are positive, but the compiler is not as smart as us. Using an unsigned type makes it clearer, both for the compiler and for the reader. And it yields faster code.

-- "int", of course, is not very good for an array index. Using "int" limits the size of strings which can be processed (to 32 KB on a 16-32-bit architecture 68000 or derivative, to 2 GB on a 32-64-bit architecture AMD64, Sparc v9...).

-- The algorithm computes values which may be quite biased for typical strings (it depends on what strings are used, of course). If the strings are, for instance, C identifiers, I suggest using a variation on the standard hash function for the ELF executable file format, which can be implemented thus:

Thou shalt have no other gods before the ANSI C standard 1553
On Sat, 26 Feb 2005 02:35:33 GMT in alt.folklore.computers, Randy That fact is recognized in the established professions, where you have to article, or the engineering equivalent: newbies...

unsigned long elfhash(const unsigned char *str) { unsigned long h = 0, g;

while (*str) { g = h & 0xF0000000UL; h &= ~g; } return h & 0xFFFFFFFFUL; }

This function returns a 32-bit value. This code uses "unsigned long" but can be adapted to select at compile-time a smaller type if, for instance, "unsigned long" is 64-bit and "unsigned int" is 32-bit. This has been discussed ad nauseam.

A 32-bit value can further be reduced by "% TBLSIZE", which should yield no or negligeable bias if TBLSIZE is either a power of 2, or small enough compared to 2^32.

--Thomas lovein



Your Ad Here

List | Previous | Next

Thou shalt have no other gods before the ANSI C standard 1553

Alt Folklore Computers from Newsgroups

The #1 Usenet Provider on the Internet

String hashing was: Thou shalt have no other gods before the ANSI C standard 1551