A 64-bit CRC implentation compatible with crc64nvme

A 64-bit CRC implementation compatible with crc64nvme

C++ code for 64-bit CRC (July 2025), public domain, use as you wish:

On linux with g++, compile it with g++ -O3 -mpclmul -msse4.1 crc.cpp crctest.cpp -o crctest.exe . It took 9 seconds to pass on my AMD Ryzen 7 5825U.

g_crc->Compute("hello world!", 12, 0) == 0xd9160d1fa8e418e3.

An effort was made to make Compute() fast for short keys as well as long ones, and to make Concat() fast. Both Compute() and Concat() are single-threaded. Making short keys fast is the standard slog: use switch statements, with fall-throughs when that makes sense, to avoid if-statements and allow optimizing each length separately. Unroll loops. Loading that complicated code will be slow if CRC is called just once for a short key, but you don't care about its speed if it is called just once for a short key, so that's OK. For long keys, there is an unrolled loop that handles 256 bytes at a time in two parallel interleaved tracks (instruction level parallelism, within a single thread). An unexpected trick is that these tracks are combined once every 256 bytes, which prevents the memory being read by the tracks from drifting apart and causing cache contention.

Making Concat(A,B) fast relies on Compute() being fast, and using a table to handle 8 bits of exponent at a time, and skipping the adjustment of A's final CRC if both A and A+B use the same starting CRC. I'm posting this so this implementation is public, in particular the trick of doing the exponentiation in Concat() eight bits at a time.

Is this fast? Faster than the other CRC implementations you have available? You'll have to time it. I'm sure there will be faster ones available than this eventually. At the time I wrote it, on the platforms I ran on, it was the best I could do.

The reason to use CRC to check the integrity of a persistent store is not that it is great, but rather that it is good enough for detecting non-opponent corruptions, and it is the default. Nowadays any fast noncryptographic hash should be bound by memory speed rather than compute, so CRC should not be faster or slower than anything else. Any other hash you choose will eventually seem like an antique choice, but CRC will always be the (even more antique) default. Also, due to the algebraic simplicity of CRC, Compute() will always be very optimizable, no matter what future architectures show up. Also Concat() gives some flexibility that other hashes don't.

When you pass CRC-protected buffers from here to there, the way to do it is first know the CRC of the data, write the buffer, then check that the CRC of that buffer is correct. You pass both the buffer and the CRC to the next guy and they do the same. You do not write the buffer then compute the CRC and send both, because if your write is corrupt then that CRC will include your corruption.

If you have many records, you need to note the record boundaries, but storing a CRC per record may be overkill. Records can be small, very small, like a byte. Have some format that lets you store a CRC no more than once per kilobyte or so, and if you have to send off smaller pieces you can compute the CRC for them when you need to. That way these 8-byte CRCs do not add much space overhead.

Compute() and Concat() are both single threaded. If you can afford n threads, you can get n-way parallelism by splitting a long key into n pieces, compute the CRC for each, then use Concat() to combine the CRCs into a final result. It will give the same result as if it were computed sequentially. If arbitrary hashes are structured in a Merkle tree, they can also get n-way parallelism like this, but they would have to split the key at the specific boundaries that align with the branches of the Merkle tree. CRC can split anywhere, for example on any record boundary.