Alex I
Alex I

Reputation: 20287

How to calculate the hash of a string literal using only the C preprocessor?

I need to get a string checksum or hash (or something equivalent) using just the C preprocessor, if possible.

The use case is as follows: I'm doing error logging on an embedded device with very limited memory and cpu. I would like to define a LogError() macro which inserts hash(__FILE__) and __LINE__ in a circular buffer (as 16bit numbers). But hash(__FILE__) needs to be compiled to a constant; if the actual filenames are stored as strings in the program that would use too much memory. The hash can be calculated using any method.

It is possible to #define FILE_ID with some unique number at the top of every file, and use that when logging, but that is not the preferred solution, it has a bit of a maintenance cost. Is there a better method?

Upvotes: 7

Views: 3557

Answers (3)

Robert Gałat
Robert Gałat

Reputation: 129

How to do this in CMake ? I can not figure out how to run script for every source file, and the result place as a value for -DFILE_CHEKSUM

Method with STRHASH, has a big limitation - it takes maximum of 32 characters as input, and when I have longer names, it just does not compile. I think that computing it before build and passing it as -D is beter approach, but I just can not manage to do it in cmake :(

Upvotes: 0

ideasman42
ideasman42

Reputation: 48058

The question "How to calculate the hash of a string literal using only the C preprocessor?" is valid, however I think you're adding a red-herring by including details about __FILE__ and logging ID's.

This means anyone answering needs to solve the problem you describe, or answer the question on hashing a string with the pre-processor (which may not be a good solution in your particular case!).

As it happens, __FILE__ expands to variable, not a literal string (GCC at least), so you will need to define the filename as a constant. You could use the build system to pass along a define for each for example.

As others have pointed out you can calculate the hash and pass this in via the build-system, although this avoids the question about hashing a sting literal.


Whatever the case, this question comes up when I searched for using the pre-processor for hashing, and none of the answers cover this, so heres is an answer that covers the string hashing part.

This is possible, albeit quite verbose

/**
 * Implement compile-time string hashing on string literals.
 *
 * This macro implements the widely used "djb" hash apparently posted
 * by Daniel Bernstein to comp.lang.c some time ago.  The 32 bit
 * unsigned hash value starts at 5381 and for each byte 'c' in the
 * string, is updated: ``hash = hash * 33 + c``.  This
 * function uses the signed value of each byte.
 *
 * note: this is the same hash method that glib 2.34.0 uses.
 */

#define SEED 5381

#if 0
// correct but causes insane expansion
#  define _SH(e, c) (((e) << 5) + (e) + (unsigned char)(c))
#elif defined(__GNUC__)
// Use statement-expression extension
#  define _SH(e, c) ({ unsigned int _e = (unsigned int)(e); (_e << 5) + _e + (unsigned char)c; })
#else
// use an inline function, the compiler will be able to optimize this out.
static inline unsigned int _SH(unsigned int e, unsigned char c)
{
    unsigned int _e = (unsigned int)e;
    return (_e << 5) + _e + (unsigned char)c;
}
#endif

#define _SH_1(a) _SH(SEED, (a)[0])
#define _SH_2(a) _SH(_SH_1(a), (a)[1])
#define _SH_3(a) _SH(_SH_2(a), (a)[2])
#define _SH_4(a) _SH(_SH_3(a), (a)[3])
#define _SH_5(a) _SH(_SH_4(a), (a)[4])
#define _SH_6(a) _SH(_SH_5(a), (a)[5])
#define _SH_7(a) _SH(_SH_6(a), (a)[6])
#define _SH_8(a) _SH(_SH_7(a), (a)[7])
#define _SH_9(a) _SH(_SH_8(a), (a)[8])
#define _SH_10(a) _SH(_SH_9(a), (a)[9])
#define _SH_11(a) _SH(_SH_10(a), (a)[10])
#define _SH_12(a) _SH(_SH_11(a), (a)[11])
#define _SH_13(a) _SH(_SH_12(a), (a)[12])
#define _SH_14(a) _SH(_SH_13(a), (a)[13])
#define _SH_15(a) _SH(_SH_14(a), (a)[14])
#define _SH_16(a) _SH(_SH_15(a), (a)[15])
#define _SH_17(a) _SH(_SH_16(a), (a)[16])
#define _SH_18(a) _SH(_SH_17(a), (a)[17])
#define _SH_19(a) _SH(_SH_18(a), (a)[18])
#define _SH_20(a) _SH(_SH_19(a), (a)[19])
#define _SH_21(a) _SH(_SH_20(a), (a)[20])
#define _SH_22(a) _SH(_SH_21(a), (a)[21])
#define _SH_23(a) _SH(_SH_22(a), (a)[22])
#define _SH_24(a) _SH(_SH_23(a), (a)[23])
#define _SH_25(a) _SH(_SH_24(a), (a)[24])
#define _SH_26(a) _SH(_SH_25(a), (a)[25])
#define _SH_27(a) _SH(_SH_26(a), (a)[26])
#define _SH_28(a) _SH(_SH_27(a), (a)[27])
#define _SH_29(a) _SH(_SH_28(a), (a)[28])
#define _SH_30(a) _SH(_SH_29(a), (a)[29])
#define _SH_31(a) _SH(_SH_30(a), (a)[30])
#define _SH_32(a) _SH(_SH_31(a), (a)[31])

// initial check prevents too-large strings from compiling
#define STRHASH(a) ( \
    (void)(sizeof(int[(sizeof(a) > 33 ? -1 : 1)])), \
    (sizeof(a) == 1) ? SEED : \
    (sizeof(a) == 2) ? _SH_1(a) : \
    (sizeof(a) == 3) ? _SH_2(a) : \
    (sizeof(a) == 4) ? _SH_3(a) : \
    (sizeof(a) == 4) ? _SH_3(a) : \
    (sizeof(a) == 5) ? _SH_4(a) : \
    (sizeof(a) == 6) ? _SH_5(a) : \
    (sizeof(a) == 7) ? _SH_6(a) : \
    (sizeof(a) == 8) ? _SH_7(a) : \
    (sizeof(a) == 9) ? _SH_8(a) : \
    (sizeof(a) == 10) ? _SH_9(a) : \
    (sizeof(a) == 11) ? _SH_10(a) : \
    (sizeof(a) == 12) ? _SH_11(a) : \
    (sizeof(a) == 13) ? _SH_12(a) : \
    (sizeof(a) == 14) ? _SH_13(a) : \
    (sizeof(a) == 15) ? _SH_14(a) : \
    (sizeof(a) == 16) ? _SH_15(a) : \
    (sizeof(a) == 17) ? _SH_16(a) : \
    (sizeof(a) == 18) ? _SH_17(a) : \
    (sizeof(a) == 19) ? _SH_18(a) : \
    (sizeof(a) == 20) ? _SH_19(a) : \
    (sizeof(a) == 21) ? _SH_20(a) : \
    (sizeof(a) == 22) ? _SH_21(a) : \
    (sizeof(a) == 23) ? _SH_22(a) : \
    (sizeof(a) == 24) ? _SH_23(a) : \
    (sizeof(a) == 25) ? _SH_24(a) : \
    (sizeof(a) == 26) ? _SH_25(a) : \
    (sizeof(a) == 27) ? _SH_26(a) : \
    (sizeof(a) == 28) ? _SH_27(a) : \
    (sizeof(a) == 29) ? _SH_28(a) : \
    (sizeof(a) == 30) ? _SH_29(a) : \
    (sizeof(a) == 31) ? _SH_30(a) : \
    (sizeof(a) == 32) ? _SH_31(a) : \
    (sizeof(a) == 33) ? _SH_32(a) : \
    0)
// last zero is unreachable

// only for comparison
unsigned int strhash_func(const void *str)
{
    const signed char *p;
    unsigned int h = 5381;

    for (p = str; *p != '\0'; p++) {
        h = (h << 5) + h + (unsigned int)*p;
    }

    return h;
}

/* -------------------------------------------------------------------- */
#include <stdio.h>

#define TEST_STR1 "Hello World"
#define TEST_STR2 "Testing 123"
int main(void)
{
    unsigned int A = STRHASH(TEST_STR1);
    unsigned int B = STRHASH(TEST_STR2);

    printf("String hash: const %u <- '%s'\n", STRHASH(TEST_STR1), TEST_STR1);
    printf("String hash: const %u <- '%s'\n", STRHASH(TEST_STR2), TEST_STR2);
    printf("String hash: dyn %u <- '%s'\n", strhash_func(TEST_STR1), TEST_STR1);
    printf("String hash: dyn %u <- '%s'\n", strhash_func(TEST_STR2), TEST_STR2);

#if defined(__GNUC__)
    printf("Is this known at compile time?, answer is: %d\n", __builtin_constant_p(A));
#endif
}

Note, for some reason Clang 5.0 prints answer is: 0, however on closer inspection is does in fact know the value at compile time, __builtin_constant_p just doesn't seem to work as GCC does.

Upvotes: 7

Mike Kinghan
Mike Kinghan

Reputation: 61337

You're asking too much of the preprocessor.

Better just build each file with a 16bit checksum of its name defined as preprocessor macro, as @n.m suggests.

One trifling difficulty for this solution is laying your hands on a 16bit checksum utility. The GNU cksum tool is just 32-bit.

But FreeBSD has a better one that lets you choose 16 or 32 bits. If your development system is a Debian derivative then you can get it by:

sudo apt-get install freebsd-buildutils

Then run:

dpkg-query -L freebsd-buildutils

to see where the FreeBSD cksum has been installed. In my case, it is:

/usr/bin/freebsd-cksum

but you might find differently.

You direct FreeBSD cksum to produce a 16bit checksum by passing the option -o 1. You can check its man page.

Take care when generating the preprocessor macro that defines a filename checksum that you define the checksum as a 16 bit unsigned int, as that's what you want. Should it come out as a plain decimal numeral, for instance, it would default to signed int, which could cause you surprises.

Here's a sketch of the solution with GNU make:

main.c

#include <stdio.h>
#include <stdint.h>

int main(int argc, char **argv)
{
    printf("%hu\n",FILE_CHKSUM);
    return 0;
}

Makefile

.phony: all clean

SRCS = main.c
OBJS = $(SRCS:.c=.o)
# Take the 1st word of the output of `echo <filename> | freebsd-cksum -o 1`
FILE_CHKSUM = $(word 1,$(shell echo $(1) | freebsd-cksum -o 1))

all: prog

%.o:%.c
    $(CC) $(CPPLAGS) $(CFLAGS) -DFILE_CHKSUM='((uint16_t)$(call FILE_CHKSUM,$<))' -c -o $@ $< 

prog: $(OBJS)
    $(CC) $(CFLAGS) $(LDFLAGS) -o $@ $(OBJS) $(LDLIBS)

clean:
    rm -f $(OBJS) prog

Running make:

cc   -DFILE_CHKSUM='((uint16_t)3168)' -c -o main.o main.c 
cc   -o prog main.o

Run prog:

./prog
3168

Upvotes: 0

Related Questions