C++ code for 64-bit CRC (July 2025), public domain, use as you wish:
On linux with g++, compile it with g++ -mpclmul -msse2 crc.cpp crctest.cpp -o crctest.exe
g_crc->Compute("hello world!", 12, 0) == 0xd9160d1fa8e418e3.
An effort was made to make Compute() fast for short keys as well as long ones, and to make Concat() fast. Making short keys fast is the standard slog: use switch statements, with fall-throughs when that makes sense, to avoid if-statements and allow optimizing each length separately. Unroll loops. Loading that complicated code will be slow if CRC is called just once for a short key, but you don't care about its speed if it is called just once for a short key, so that's OK.
Making Concat(A,B) fast relies on Compute() being fast, and using a table to handle 8 bits of exponent at a time, and skipping the adjustment of A's final CRC if both A and A+B use the same starting CRC. I'm posting this so this implementation is public, in particular the trick of doing the exponentiation in Concat() eight bits at a time.
Is this fast? Faster than the other CRC implementations you have available? You'll have to time it. I'm sure there will be faster ones available than this eventually. At the time I wrote it, on the platforms I ran on, it was the best I could do.
The reason to use CRC to check the integrity of a persistent store is not that it is great, but rather that it is good enough for detecting non-opponent corruptions, and it is the default. Nowadays any fast noncryptographic hash should be bound by memory speed rather than compute, so CRC should not be faster or slower than anything else. Any other hash you choose will eventually seem like an antique choice, but CRC will always be the (even more antique) default. Also, due to the algebraic simplicity of CRC, Compute() will always be very optimizable, no matter what future architectures show up. Also Concat() gives some flexibility that other hashes don't.
When you pass CRC-protected buffers from here to there, the way to do it is first know the CRC of the data, write the buffer, then check that the CRC of that buffer is correct. You pass both the buffer and the CRC to the next guy and they do the same. You do not write the buffer then compute the CRC and send both, because if your write is corrupt then that CRC will include your corruption.
If you have many records, you need to note the record boundaries, but storing a CRC per record may be overkill. Records can be small, very small, like a byte. Have some format that lets you store a CRC no more than once per kilobyte or so, and if you have to send off smaller pieces you can compute the CRC for them when you need to. That way these 8-byte CRCs do not take up much space overhead. A fast Concat() makes it convenient to split and glue pieces together at arbitrary boundaries without changing the CRC of larger chapters that include them.