A hardware hash

Work in progress: Maraca, a submission for SHA-3

Maraca has been submitted to NIST, rejected by NIST, and later utterly broken by Sebastiaan Indesteege. A message hashing to an arbitrarily chosen hash value can be derived in about two seconds of computing time.

NIST is having a competition to define a new cryptographic hash to replace SHA-2 (they'll dub it SHA-3). I think I'll enter, if I can come up with something worthwhile. This page is a work-in-progress of the hash I'm currently considering. I'm weak on hardware and hash cryptanalysis, so I expect I'll be revising these hashes considerably as I learn more.

(Feb 10 2008) My first attempt, nandhash.c, isn't crypto at all, it's just attempting to do an efficient hardware hash. It hashed 128 bits arranged in eight 4x4 blocks, with alternating 4x4 bit mixes and XORs between bits in neighboring blocks. The 4x4 bit mix consisted of an LFSR of 8 of the bits (3 or 4 XORs per bit), and (~s)^((w&x)|(y&z)) for the remaining 8 bits (for some w,x,y,z chosen from the preceding bits). In software, the 8 4x4 blocks can be mixed simultaneously by storing each of the 16 bits of each in 16 8-bit words, then manipulating the words.

NIST wants at least a 512-bit hash value. NIST also wants the ability to parameterize the hash with a 512-bit secret key. That dictates at least a 512-bit state and 512-bit key. I'm guessing I'll do this with a 1024-bit state, with a 1024-bit mix, where I XOR 512 bits of message to one side and 512 bits of key to the other side, then mix, with some setup and wrapup procedure. I expect to hash 2²⁰ byte subsequences, then hash the 1024-bit states of the subsequences if messages are longer than that, so that the hash can make use of massive parallelism for large streams. Or perhaps I should only XOR 128 bits of message and key at a time, but 4x as often?

One attack on crypto hashes is to find x that produces pattern y in reverse and pattern z forward. Then if you start with pattern y, run it forward you'll get x, then run it forward more you'll get z. So if it takes m rounds backwards to achieve avalanche, and n rounds forward, then any less than m+n rounds is clearly not good enough for a crypto hash. The right number of rounds is probably double or triple that, but m+n is the largest number of rounds I can easily show is no good. It's a good metric for comparing mixing functions.

(Feb 11 2008) The next hash, nisthash1.c, was the same 4x4 mix, but there were 8x8 4x4 blocks (1024 bits in all). The 4x4 mix is now done in software with 64-bit words instead of 8-bit words, since there are now 64 4x4 blocks instead of 8 4x4 blocks. Horizontal and vertical mixing XORed all 8 corresponding horizontal and vertical bits, then XORed that again to the original bit. I tested one-bit diffs against random bases. It took 5 rounds forward and 5 rounds backward to achieve avalanche. The 4x4 mix is 6 nands deep, and the XOR is 9 nands deep, so (5+5)*(6+9)=150 nands deep minimum.

Nisthash1.c did awful if the initial state was mostly zero or mostly one. It also had symmetry between all the 4x4 blocks, which was very bad forever when the initial state was symmetric.

(Feb 18 2008) Next came the imaginatively named nisthash2.c. This broke symmetry by complementing s in (~s)^((w&x)|(y&z)) only half the time. In software that was accomplished by XORing a constant to s. I chose "half the time" in a very regular way that guaranteed each 4x4 mix differed from every other one in at least 2 ways. Randomly chosen constants worked better than that very regular way. But NIST wants the design decisions to be fully explainable. I expect I'll eventually resort to randomly chosen constants, randomly generated by my small state prng with some seed. I've been using that a lot for hillclimbing. Also in this nisthash2.c, I allowed w,x,y,z to be optionally complemented. I tested this with almost-all-zero and almost-all-one states, as well as with random states. It took 6 rounds backwards and 6 rounds forwards for avalanche, so (6+6)*(6+9) = 180 nands deep minimum.

(Feb 19 2008) The next, nisthash3.c, was basically the same as nisthash2.c, but it had a slightly better mixing function. I noticed that although I allowed the top eight bits to nonlinear mix any bit before them, the best functions usually used only the first eight bits, so in nisthash3.c I added that restriction, w,x,y,z had to come from the first eight bits. About the same time I notice a bug in my hillclimber that prevented it from using the previous bit, which might also explain why the first eight bits were favored. Still (6+6)*(6+9) = 180 nands deep minimum.

(Feb 22 2008) Several people have told me that wire length doesn't matter much, it's nand depth I should pay attention to. I don't have a hardware simulator to tell me one way or the other. But if wire length is irrelevant, perhaps I should eliminate the XORs in horizontal and vertical mixing. nisthash4.c does that. I also realized that the s side of s^((w&x)|(y&z)) isn't many nands deep, so I could get more mixing in by doing (s^t)^((x&x)|(y&z)). So I do an LFSR with 1 or 2 taps for the top 8 bits, as well as an LFSR with 3 or 4 taps for the bottom 8 bits. To keep this reversible I could only use w,x,y,z from the bottom 8 bits. 6 rounds backwards or 8 rounds forward for avalanche, so (6+8)*(6) = 84 nands deep minimum. Unfortunately this got rid of my symmetry-breaking, and the hash is horribly symmetric again.

Is 84 nands deep good? What's the best possible? XOR is 3 nands deep, and you need a 10-level tree to combine 1024 bits, so XOR trees forward and backwards would be (10+10)*3 = 60 nands deep. At 3 nands deep, XOR could be combining 8 variables, but it's only combining 2. If you look at (s^t)^~(~(w&x)&~(y&z)), the XORs are giving you avalanche, and the rest is giving you nonlinearity. It's 6 nands deep. Is there some sort of permutation that could give you nonlinearity and avalanche simultaneously that's less than 6 nands deep? I don't know. I don't think anybody has checked.

I wrote a program, balancenand.c (March 7 2008), to come up with all balanced binary functions that are n nands deep or less. After it generates the truth tables, I run sort and uniq on them. There are 116 distinct truth-tables for balanced binary functions 3 nands deep or less (none uses more than 4 variables), and 2,094,862 distinct truth-tables for balanced binary functions 4 nands deep or less using 6 variables or less. None of them do avalanche quite as well as XOR, but some of the 5-variable functions come close. 4 nands deep, 7 variables or less, has 26,771,332 distinct truth tables.

(March 13 2008) Another program, nandperm.c, reads in a file of truth tables and randomly hillclimbs trying to build a permutation out of them. It picks one bit to replace, checks if a new balanced function can replace it, and if it can it also checks if the new balanced function can extend the permutation without replacing it. Once a whole permutation is made, it continues, replacing one bit function at a time, walking through many related permutations. I don't know if the graph of permutations is connected that way.

(March 17 2008) For example, here is its report for a 4-nand-deep permutation of 16 bits:

4   0, 1, 5,14   1111000100011100
4   2, 4, 6, 8   1010101001010110
4   1, 4, 6, 7   1111000000011110
4   2, 4, 6, 8   1111000100001110
5   0, 1, 5,12,14   10001001100011111001100110011001
5   5, 9,10,11,15   00000000111111110111011100001010
4   0, 1, 4, 5   1111000000011110
4   0, 1, 3, 6   1100001111010010
5   5, 9,10,11,15   00001111000011110101001011110010
5   5, 9,10,11,15   00110011001100110100110001101110
5   5, 9,12,13,15   11110000000010111111010100001011
4   0, 5,12,14   1010010110110010
5   0, 1, 5,12,14   00011100000111000001111011011110
4   0, 3, 4, 7   1010101101010100
5   5, 9,12,13,15   11110001111101010000101000001011
5   0, 1, 5,12,14   11111111000000000001010110101011

Here are a few more (March 20 2008).

(March 15 2008) It turns out 3-way XOR can be implemented in a tree only 5 NANDs deep. Does this generalize to N-way XOR being 2*ceil(log₂(N))+1 NANDs deep, rather than 3*ceil(log₂(N)) NANDs deep? If it does, does this require an exponential number of gates? I'm not sure yet. I wish I knew whether 2-way NANDs or 2-way NORs or 3-way NANDs or 4-way NORs or whatever were the standard component for circuits.

(March 16 2008) To measure the avalanche of a 0/1 function, for every input value, flip each input bit and add to a total if the the result changes, then divide the total by the number of input values. The maximum avalanche for an n-bit 0/1 function is n, and it is achieved by the XOR of those n bits. Consider a function !(x&y), where x and y are themselves 0/1 functions. What can we say about x and y if !(x&y) is balanced and has avalanche > Z? Some zeros come from x, some from y, some from both x and y. If x has (.5-q) zeros, then y must contribute exactly q zeros that aren't shared by x (note q is a fraction of the total, not a raw count). The avalanche of !(x&y) is at most the sum of the avalanche of x and the avalanche of y. And this is achieved only if x contributes all its zeros and y also contributes all its zeros, which is only possible when the zeros of x and y sum to .5. Suppose x is .5-q zeros but y is r zeros, where r > q. It's reasonable to expect at most q/r of the avalanche from y to be contributed to !(x&y), total avalanche probably under (avalanche x) + (q/r)(avalanche y). The avalanche in y might be concentrated in a subsets of the zeros, so we can do better, but if we're picking functions that have unusually high avalanche, this is unlikely. If the input bits that x and y depend on don't overlap, for !(x&y) to be balanced, we need (ones in x)*(ones in y) = .5, so both are sqrt(.5) on average. Similarly avalanche would be sqrt(.5)(avalanche in x + avalanche in y).

(March 22 2008) 2ⁿ-way XOR can be implemented only 2n+1 NANDs deep. The 1 is to turn all inputs into themselves and their complements: (a,!a) is implemented as (a, !(a&a)). Given (a,A),(b,B), where a and b are each the XOR of n variables and A,B are their complements, ((a^b),!(a^b)) is implemented as (!(!(a&B)&!(A&b)), !(!(a&b)&!(A&B))). This means XOR is quite a bit more competitive than I thought it was at achieving avalanche. My surveys so far find nothing 4-deep or 5-deep that approaches the ability of XOR to achieve avalanche. An XOR of 4 variables is 5 deep and has an avalanche of 4. The best 5-deep non-XOR I've found so far uses 5 variables and has avalanche of 3.5, 10100011010111000101110010100011. It's clearly a variant on XOR. XOR increases avalanche 1.4142x per level. Random trees with the golden ratio of bits set (which causes !(a&b) to also have the golden ratio of bits set) increases avalanche by about 1.236x per level, well short of XOR. I'll search 5-deep balanced functions a bit more, but at the moment I believe that Feistel networks are about the optimal way to achieve avalanche out of all possible NAND trees.

(March 27 2008) I realized that hillclimbing to find an initial 16-bit permutations (which takes days) can be skipped by instead initializing the hillclimber with two 8-bit permutations (which takes a few seconds). Here's a 5-deep 16-bit permutation where every balanced function has avalanche between 3.25 and 4.0. (There are no 6+ variable balanced functions that are 5-deep with avalanche greater than 3.125.) I'm not claiming it's a particularly good permutation, it's just a permutation.

1, 3, 4, 5, 6   10111000010001110100011110111000  (((1 (1 2)) ((0 0) (1 1))) ((3 4) ((3 3) (4 4))))  (((0 (0 1)) (1 2)) ((3 (3 4)) (4 (3 3))))
1, 3, 4, 5, 6   10001011011101000111010010001011  (((0 (0 1)) (1 (1 2))) ((3 (3 4)) (4 (3 3))))  (((1 2) ((0 0) (1 1))) ((3 4) ((3) (4 4))))
0, 2, 3, 7, 9   11100100000110110001101111100100  (((0 1) (2 (0 0))) ((3 (3 4)) (4 (3 3))))  (((0 (0 1)) ((0 0) (2 2))) ((3 4) ((3 3) (4 4))))
0, 3, 5, 7   1001011001101001  (((0 1) ((0 0) (1 1))) ((2 3) ((2 2) (3 3))))  (((0 (1 1)) ((0 0) 1)) ((2 (3 3)) ((2 2) 3)))
3, 4, 5, 6,13   10010110100101100110101101100001  (((0 (1 4)) (2 (3 4))) ((1 (0 2)) (4 (0 2))))  (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3))))
2, 5,10,12,13   10100101010110101001011010010110  (((0 2) ((0 0) (2 2))) ((4 (1 1)) ((1 4) (3 3))))  (((0 (0 2)) (2 (0 0))) ((1 4) (3 (3 4))))
0, 4, 5, 6   1001011001101001  (((0 1) ((0 0) (1 1))) ((2 3) ((2 2) (3 3))))  (((0 (1 1)) ((0 0) 1)) ((2 (3 3)) ((2 2) 3)))
0, 3, 4, 6   1001011001101001  (((0 1) ((0 0) (1 1))) ((2 3) ((2 2) (3 3))))  (((0 (1 1)) ((0 0) 1)) ((2 (3 3)) ((2 2) 3)))
1, 3, 5, 6, 7   10101001010101100110010110011010  (((0 3) ((0 0) (3 3))) ((1 (1 2)) ((1 1) (4 4))))  (((0 (0 3)) (3 (0 0))) ((1 2) (4 (1 1))))
2, 4, 8, 9,12   10100101010110101001011010010110  (((0 2) ((0 0) (2 2))) ((4 (1 1)) ((1 4) (3 3))))  (((0 (0 2)) (2 (0 0))) ((1 4) (3 (3 4))))
4, 6,11,12,15   10011001011001100110001011011001  (((0 (1 4)) (3 (1 4))) ((1 (0 3)) (4 (0 2))))  (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3))))
0, 4, 7,12,15   10010110110111100010000101101001  (((0 (0 2)) (2 (0 0))) ((1 (1 4)) (3 (1 1))))  (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3))))
0, 4,11,12,14   10010010011000011101011011101001  (((0 3) ((0 0) (3 3))) ((1 2) ((1 1) (2 2))))  (((0 (0 3)) (3 (0 0))) ((1 (1 2)) (4 (1 1))))
1, 4, 5, 6,13   00011010001001011011010101111010  (((0 (0 3)) (3 (0 0))) ((1 (1 2)) (2 4)))  (((0 2) ((0 0) (2 2))) ((1 (1 3)) (3 (1 1))))
8, 9,11,13,14   00111100001011011100001111010010  (((0 3) ((0 2) (1 1))) ((2 (2 4)) (4 (2 2))))  (((1 (0 3)) (2 (2 4))) ((2 4) ((1 2) (1 4))))
2, 4, 6, 7,10   10010110011011011001011001100001  (((0 (1 3)) (2 (1 3))) ((1 (0 2)) (3 (2 4))))  (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3))))

I suspect that this approach can't be competitive in software with table-based hashes, though it's likely faster in hardware. Table-based hashes take advantage of the multiple layers of gates that accomplish index lookup, while what I'm doing here is one software instruction per gate (OK per 64 gates). There might be some advantage in the table values being precomputed, too.

(April 9 2008) Here's a permutation that can claim that the difference between its mapping for 0 and its mapping for a value with 1 bit set will always have at least 2 bits different. That's something I couldn't accomplish with LFSRs and Feistel networks. NAND depth 5.

5   4, 9,10,12,14   10010110011110011000011001101001
5   3, 4, 5, 8,10   01111001100101110010100110010010
5   0, 1, 3, 6,13   10100101010110101001011010010110
5   1, 7,10,14,15   10010110100101000110101101101001
5   3, 7,10,11,15   10010110011110011001011000101001
5   2,10,12,14,15   10010101011010101010011001011001
5   0, 1, 6, 9,11   10011001111101100110011010010000
5   1, 2, 3, 6,13   00111100110000111001011010010110
5   0, 1, 3, 9,13   10011001011001100111001010110001
5   2, 3, 4, 7,10   10010110100100100110110101101001
5   0, 6, 9,11,13   01110110110110011001100001100010
5   5, 8,11,14,15   00111100110000111010001100111010
5   0, 6, 9,11,13   11010001011101000011110011000011
5   0, 6, 9,11,13   10010100011000011001011101101101
5   0, 2, 6, 9,11   00011100110100111100000100111101
5   0, 2, 3, 5, 8   10111000010001110100011110111000

(April 10 2008) To verify these 16-bit permutations are permutations, I'm checking all 2¹⁶ values. Suppose I wanted to build a 512-bit permutation, it wouldn't be feasible to check all 2⁵¹² values. But, suppose each output bit depends on at most 5 input bits. Does that help? I think so. If I want to change the function f(a,b,c,d,e) for the last bit to g(a,b,c,d,e), I just have to find all output bits that have a,b,c,d,e as their inputs and evaluate them. So I map a bit to its inputs, back to all their outputs, then back to their inputs, and I go through all values for all those input bits. I think if the result is balanced then it's still a permutation with g(a,b,c,d,e) replacing f(a,b,c,d,e). I think that shows a,b,c,d,e are still happy, and it can't break anything other than a,b,c,d,e.

If that selects fewer than all input bits, that would be faster than evaluating all input bits. Doing three mappings like that could depend on up to 89 bits if both directions had a branching of 5. But inputs to outputs have a variable branching, average 5 sure, but some inputs are used by many more than 5 outputs. You could easily get unlucky and need to evaluate all possible settings of over a hundred of bits. That's a large theoretical improvement over 2⁵¹² values, but even 2⁸⁹ isn't feasible in practice. So, that optimization looks like a dead end.

Perhaps if I calculate the inverse of the permutation, then I'd only have to do two mappings, from the output to be changed back to the inputs back to the outputs? That's up to 21 output bits with a constant branching of 5, which is close to feasible. So far I have no idea how to make that work. I suspect it can't be made to work.

(April 11 2008) fivebalanced.txt is a list of a few thousand of the best-mixing 5-deep balanced functions.

(May 3 2008) Some of the better-mixing 16-bit permutations I've seen are actually two 8-bit permutations followed by a shuffle of all 16 bits. My framework has 64 of the 16-bit chunks to permute, which maps nicely to a 64-bit integer for each of the 16 bits. If I limit the hash to a single 8-bit permutation repeated twice, then I can compute that 16-bit permutation using SSE2 instructions, 128 bits at a time. I'd still have to split it into 16 64-bit integers to shuffle the bits and do the per-bit rotates.

(July 22, 2008) I'm using tree2c.c to convert 8-bit permutations to C code. They vary from 65 to 130 & | ^ ~ operations. For example

u4 truth1[] = {
 0xad54659a, 0x00006996, 0x9648b769, 0x00009669,
 0x987662d9, 0x9696eb28, 0xd54aa659, 0x00006996,};

u4 mask1[] = {
 0x0d6, 0x02b, 0x0d3, 0x03c,
 0x0ab, 0x0d6, 0x0d6, 0x02d,};

void x1(u8 *v)
{
  u8 a0 = v[0];
  u8 a1 = v[1];
  u8 a2 = v[2];
  u8 a3 = v[3];
  u8 a4 = v[4];
  u8 a5 = v[5];
  u8 a6 = v[6];
  u8 a7 = v[7];
  u8 a8 = a1 ^ a6;
  u8 a9 = ~a7;
  u8 a10 = a2 & a9;
  u8 a11 = a1 & a7;
  u8 a12 = a11 | a4;
  u8 a13 = ~a12;
  u8 a14 = a10 | a13;
  u8 a15 = a8 & a14;
  u8 a16 = a1 & a6;
  u8 a17 = a4 & a6;
  u8 a18 = a1 | a17;
  u8 a19 = ~a18;
  u8 a20 = a16 | a19;
  u8 a21 = a2 & a7;
  u8 a22 = ~a2;
  u8 a23 = a4 & a22;
  u8 a24 = a21 | a23;
  u8 a25 = a20 & a24;
  u8 a26 = a15 | a25;
  u8 a27 = a0 ^ a5;
  u8 a28 = a1 ^ a3;
  u8 a29 = a27 ^ a28;
  u8 a30 = a0 ^ a4;
  u8 a31 = a30 & a8;
  u8 a32 = a1 | a7;
  u8 a33 = ~a16;
  u8 a34 = a32 & a33;
  u8 a35 = a30 | a34;
  u8 a36 = ~a35;
  u8 a37 = a31 | a36;
  u8 a38 = a2 ^ a5;
  u8 a39 = a3 ^ a4;
  u8 a40 = a38 ^ a39;
  u8 a41 = ~a40;
  u8 a42 = ~a5;
  u8 a43 = a0 & a42;
  u8 a44 = ~a0;
  u8 a45 = a3 & a44;
  u8 a46 = a43 | a45;
  u8 a47 = a1 ^ a7;
  u8 a48 = a46 & a47;
  u8 a49 = a27 | a47;
  u8 a50 = ~a49;
  u8 a51 = a48 | a50;
  u8 a52 = a2 ^ a4;
  u8 a53 = a47 & a52;
  u8 a54 = a6 & a9;
  u8 a55 = a11 | a54;
  u8 a56 = ~a52;
  u8 a57 = a55 & a56;
  u8 a58 = a53 | a57;
  u8 a59 = ~a4;
  u8 a60 = a2 & a59;
  u8 a61 = a1 & a4;
  u8 a62 = ~a61;
  u8 a63 = a7 & a62;
  u8 a64 = a60 | a63;
  u8 a65 = a8 & a64;
  u8 a66 = a2 | a7;
  u8 a67 = a2 & a4;
  u8 a68 = ~a67;
  u8 a69 = a66 & a68;
  u8 a70 = a8 | a69;
  u8 a71 = ~a70;
  u8 a72 = a65 | a71;
  u8 a73 = a2 ^ a3;
  u8 a74 = a27 ^ a73;
  v[0] = a26;
  v[1] = a29;
  v[2] = a37;
  v[3] = a41;
  v[4] = a51;
  v[5] = a58;
  v[6] = a72;
  v[7] = a74;
}

The final C code is likely to have some hand-optimizations, and will be in SSE so it can handle 128 bits at a time rather than 64. Still, it's not a bad approximation to use the timings of these chunks to compare various functions. And I can construct full 1024-bit hashes out of these chunks, then see how many iterations each function needs for avalanche, and time full 1024-bit hashes. I've got 3869 such functions at the moment.

(July 24, 2008) I spent a day scanning papers looking for the standard way to associate a key with a permutation to form a block ciphers. Near as I can tell, there isn't any. k0^f(k1^block) was proposed for DESX, but that requires twice the ideal amount of key, and doesn't play well with real f() which do some XORing of their own.

Should I trickle in data, or XOR 1024-bit blocks all at once? Trickling has the advantage that there are parts of the input you can't overwrite. Big blocks have the advantage that you can do more work between writes. Attacks often involve, oh, 8 bit deltas. If you only overwrite half the block, there's a 2^-8=1/256 chance that an interesting 8-bit delta will be entirely in the half you're overwriting. The security gain from doubling the number of rounds is much more than a factor of 256. So big blocks win.

I'm building a hash. It doesn't have to permute the input blocks. I could do m0 f k f m0 f m1 f k f m1 f m2 f m2. It loses about a bit of entropy from m0, but that's OK, I've got a 1024 bit state and I only report at most 512 bits of it. I could do m1 f k f m0 f m2 f k f m1 f m3 f k f m2 f ..., that puts 2x the work between the two instances of a block instead of f. If there's an attack involving m1 and m2, then m1 and m2 get repeated twice here, it's symmetric, the same attack might hold. So I could do m1 f k f m0' f m2 f k f m1' f m3 f k f m2' f ..., where x' is x with the top and bottom half swapped. Since the top and bottom half are hashed entirely differently (they feed into different bits of the 16-bit permutation), any common useful delta between m1,m2 and m1',m2' will be coincidence. If I've got avalanche in n rounds, I recon I need about 4n rounds for cryptographic strength. I don't have to do m1 m0' m2 m1' m3 m2', I could do m3 m0' m4 m1' m5 m2' m6 m3', or whatever I need to to put the two m1's 4n rounds apart. I think that leaves a flaw in (m2,m3), and an unrelated flaw in (m2',m3'), as the worst attack. I think that has about the same probability as a flaw in (m2,m3) with twice as many rounds. That suggests m2,m3 should be 2n rounds apart, yielding m2 f k f m0' f m3 f k f m1' f m4 f m2' f, where f is 5 rounds. A round is 5 nands deep, an XOR is 3 nands, 3 is just barely negligible compared to 5*5, this looks about right. That's 84 gates deep per 1024 bits, hey, not bad. 69 gates if I can do avalanche in 6 rounds. 4*1024bits = 512 bytes of storage.

(July 24 2008) Let's take that a little more to the extreme.

    f  f  f  f  f  f  f  f  f  f  f  f  f  f  f  f  f
   k  k  k  k  k  k  k  k  k  k  k  k  k  k  k  k  k  k
   1  2  1  2  1  2 13 14 13 14 13 14 13 14 13 14 13 14
   3  4  3  4  3  4  3  4 15 16 15 16 15 16 15 16 15 16
   5  6  5  6  5  6  5  6  5  6 17 18 17 18 17 18 17 18
   7  8  7  8  7  8  7  8  7  8  7  8 19 20 19 20 19 20
  -3 -2  9 10  9 10  9 10  9 10  9 10  9 10 21 22 21 22
  -1  0 -1  0 11 12 11 12 11 12 11 12 11 12 11 12 23 24

Repeat each block 6 times. XOR all the (rotated) blocks that line up vertically below, then XOR that with the state. Each repeat of a block should be rotated by a different amount, probably (0,1,3,6,10,15), so that XORing with other blocks cancels out different information each time. (I need to confirm with Gaussian elimination.) Suppose 24 f's is enough for security. There are 6 (7,8) gaps, that requires 4 f's between the XORs. 4 f's * 6 gaps = 24 f's total in the gaps for (7,8) pairs. There's also 5 (8,7) gaps, plus a leading 7 and trailing 8, thats 4+5*4+4=28 f's, that's fine. There are 6 8's, so 5 8-gaps, with 8 f's between each, 8*5=40 f's total, that's fine. XORing everyone vertically can be done in parallel with the f's, so the XOR with the internal state to combine all these blocks is only 3 NANDs deep. 5 NANDS per f, 4*5+3=23 gates deep per 1024-bit block. There's 13 1024-bit blocks of state, 13*128 bytes = 1664 bytes to keep in memory.

I'm back to testing whole permutations. Here's a random one, nisthash5.c, similar to nisthash[1..4].c. At the moment it looks like I need 7 or 8 f's for avalanche, not 6. So far these nand-trees are looking worse than the old Feistel networks.

(July 25 2008) I switched from combining every block 6 times to combining every block 10 times, and using 3 rounds between combines instead of 5. I've got a complete hash now, nisthash6.c. It can like hash strings and everything. This runs in 12 cycles per byte in visual studio C++ on an Intel quad core (about 4 for combining blocks and 8 for mixing).

(July 26 2008) I optimized that a bit this morning (it started at 22 cycles per byte, then 18, then 13, got it to 12 this morning and update nisthash6.c). Next up is looking for characteristics? No, not yet. Next up is looking into SSE and what is required to use it. I see it requires 16 byte alignment, and likes to load and store 128 consecutive bits. I've got to rearrange my two 8-bit permutations so that the 64 low-0 bits and 64 high-0 bits are consecutive, so SSE can do all 128 permutations at once. I may be able to use SSE in combining the data as well, but that may require 8 distinct shift amounts instead of 10 (is that OK?).

The key and message blocks are all 16 64-bit values. If you XOR a 64-bit constant Q to all of them, and there's exactly two message blocks, you'll get the same hash because the key and message blocks are XORed (rotated by multiples of 64 bits) and the Q will cancel out. Easily solved: make the key 512 bits instead of 1024, and treat the high 512 bits of the key as zero.

Related. Suppose you have blocks A,B,C,D and you xor Q to all their 64-bit quantities. You'll get the first (A,B) gap and the final (C,D) gap, but in between A^C and B^D will cancel out Q. That's a total of 6 rounds to find differentials for (well short of 30). The root of this problem is that linearity in the combine allows differentials to be shuffled among blocks, probably using f(B) as well as B. That f counts towards the 30 rounds, but only once, no matter how many times f(B) is used.

(July 27 2008) OK, the game with block scheduling is for any set of blocks, you draw line segments connecting combines that use those blocks. Segments are allowed to pass through combines with one or more block. A combine with two or more blocks can be treated as a line segment of zero length. This tells me how many rounds of f need differentials found for them. The first and last block will always contribute 1 gap each, and using f(B) as well as B adds the equivalent of 1 gap every 3 blocks, so only (10-2)*3=24 blocks at most can be a threat. I need to confirm every subset of 24 or fewer blocks requires at least 10-#blocks/3 line segments.

According to spacing.c, which brute-force tries all sets of up to 8 blocks, I can accomplish that by combining block i at at offsets i, i+12-5(i%3), i+28-5((i+2)%3), i+31. I'll try to brute-force 9 blocks as well, but I don't have the resources to do more than that. Can I skip f(B)? No, that's what makes this problem finite. Otherwise you apply linear algebra and you'll find some very complicated way to have only 6 rounds to deal with.

Tracking 32+2 128-byte blocks uses just over 4K of RAM. Using f(B) as well as B requires an extra f() per block, making it 4 instead of 3 applications of f(). That will slow down the hash some in software, and require a duplicate implementation of f() in hardware. On the other hand, I'm only using each block 4x, which reduces the combining work, and eliminates my worry about combining blocks 10x but only having 8 128-bit rotates that can be done with SSE.

I've revised the hash to handle all that, nisthash7.c, but it's still not using SSE. 22.7 cycles/byte with 32-bit gcc, 11.3 cycles/byte on 64-bit Vista with visual studio. (Whoa, it got faster, perhaps I did something stupid and it's optimizing away all the combines?).

(July 31 2008) I tried using SSE for the mix and the combine. That brings it down to 7.6 cycles/byte in Visual Studio 2008 on my Vista quad core (nisthash82.c), and 10.9 cycles per byte with gcc on my 32-bit dual core XP machine (nisthash8.c). Unfortunately nisthash8.c gives different results via gcc and visual studio, and nisthash82.c only compiles under visual studio (and gives a third set of results). I'm still not doing anything but plain C for the 64-bit rotates. I think I should look at correctness next.

(August 2 2008) After fixing some bugs, nisthash9.c is back up to 7.7 cycles/byte under visual studio on my Vista quad core, and 8.4 cycles/byte under gcc on my XP dual core. I switched the permutation because the previous 8-bit permutation had a deterministic delta. I switched the key schedule too, I had a bug in spacing.c. It gives the same results on both platforms now. Now I'm using i, i+23-6*(i%4), i+26-6*((i+2)%4), i+31, which requires at least 10 gaps for up to 9 variables.

(August 4 2008) nisthash10.c. I switched the permutation and 16-bit shuffle and shifts again. 7.6 cycles/byte visual studio, 7.9 cycles/byte gcc. The 16-bit mix 3x gives 0.375 avalanche. Due to XOR, there are fixed points for individual 16-bit chunks, but each chunk has different ones. At the 1024-bit level, every bit is affected by 8 rounds forward or 7 rounds backwards (0.27, 0.29),with nearly-zero and nearly-one base states, and 7 rounds forward or 6 rounds backwards (0.378, 0.17) with random base states.

(August 6 2008) nisthash11.c. I'm thinking about what happens if you have a message that is just block B repeated over and over. After 31 combines, you'll be combining B^B^f(B)^f(B)^key every round, which is just the key. With a zero key that's no combine at all. If you change the message by adding more copies of B, those intermediate states are just the effect of 3 rounds on the internal state. But that's after 31 combines, which is 93 rounds. If you try to change the key to control this, that's 93 rounds where the key will be scrambling the internal state, so I don't see how to exploit this. Still, I think I'll change it to B^f(B)^f(B)^f(B)=B^f(B) instead of B^B^f(B)^f(B)=0. Those instances of the blocks were actually rotated by different amounts, but still.

(August 8 2008) nisthash12.c. Fixed bugs in the block scheduling. Shifted length by 3 when combining so it can handle a length in bits.

I ran my characteristic-finding code for one, two, three, four, five iterations of perm(), and it didn't find anything. I only did 50000 trials with 10000 tests of each possible characteristic, and the code is easily confused anyhow. I looked for deterministic deltas for one round. There aren't any, other than the zero delta. I tried all two-bit deltas with random bases, they do avalanche in 7 rounds forward and 6 rounds backward too.

Assuming the block scheduling works so deltas must make it through 30 rounds, then a delta with probability 3/4 per round (there are such deltas for single rounds, but they can't be chained) yields a 30-round output delta of 2^-12 probability. If I could show no chain of deltas gives more than 1/2 probability per round, that's 2^-30 probability. I need to show no chain of deltas gives 2^-17 probability per round to reach 2^-512 probability for the full 30 rounds. It seems unlikely that I could show that. I don't think I can enumerate all deltas with 2^-17 probability for a single round. But if the average input delta in a chunk has no output delta with probability greater than 1/2, that's 2^-64 per round for the whole state, so 8 rounds would have no output delta with probability 2^-512, so 30 rounds would be easily enough for non-special input deltas. How likely is the most likely output delta per chunk per input delta?

August 9 2008 Since the 16-bit chunks are actually two copies of an 8-bit permutation, with 3 bits possibly flipped, it's the behavior of those 256 deltas for those 2³ 8-bit permutations that matters. Everything else is combinations of those. Only zero-deltas are deterministic; next come 24 .75 probability deltas, 16 176/256 deltas, 8 144/256, and 32 128/256. The strongest input delta you can get to chain 3 rounds looks about 1/3 probability per round. The average output delta per 8-bit permutation has probability about 36/256, so the average output delta after one round has probability about 2^-362.

Perhaps I should check the configuration to guarantee 64 rounds min, which would require me to verify no 2^-8 per round delta chains, which might be feasible to verify. That'd be kind of sad, making it bigger and slower if it didn't really need it. I had spacing.c search today for a way to guarantee 16 gaps (with 4 rounds each) but didn't find one that used under 70 blocks.

August 10 2008 Enough, time to write this up. Me being me, one of the biggest risks is I'll be disqualified for not following the submission directons properly. I can't follow directions. So I need to submit it by Aug 31 so they'll tell me if I didn't follow the directions right. nisthash13.c: I changed it to B^B^B^f(B) instead of B^f(B)^f(B)^f(B). That way, if you had random access to the key and a 4K sliding window of the data, you only need 256 bytes or so on chip at a time, and you only have to compute f(B) once per block.

August 12 2008 nisthash14.c. I initialize the accumulators with the key, and at the end XOR them with the length (the length goes into the a[i][i%LEN] for the ith accumulator), and I omit combining any fake leading and trailing blocks. This avoids all combines and f(B) calculations when finishing up the hash. It allows streaming inputs where the length is not known until the end of the input is seen. The offsets for the length prevent a length extension attack where you use len+1, key-1. Hashing a null string takes 18400 cycles, which is much slower than encrypting the null string. Hashing a 1024 byte buffer takes 25400 cycles. I also made the final hash the whole 1024-bit state, I can't prove XORing the high and low halves is necessary.

August 13 2008 Rather than XORing a bunch of blocks together and applying a round to one of the uses of each block and XORing that to the state every three rounds, how about do the XOR thing, then go two rounds, then XOR just one block to the state, then do another round? That way I'd do 3 rounds of work per block rather than 4 rounds per block. Great! I'm searching for a block schedule that passes that. Also, my billion-byte timings are actually 2¹⁴ timings of hashing 2¹⁶ bits, not a single hash of 2³⁰ bits. Gotta fix that.

August 14 2008 nisthash15.c I changed the block scheduling to 0(+1 rounds), 26, 28, 38. That means 3 rounds of work per block instead of 4 rounds of work per block, bringing the time down to about 5.8 cycles/byte. I changed the use of the key and length so that the first block is the key and the last block is the key XORed with the length. This makes hashing an empty string even slower: 120 rounds instead of the previous 93.

August 17 2008 nisthash16.c Bah, more bugs in block scheduling. I replaced spacing.c with a version with the known bugs fixed. Now it keeps 47 128-byte accumulators around. I split the hash into init(), add(), final(), so now I can do a true 1-billion byte hash of a single 64K buffer 2¹⁴ times. 5.1 cycles/byte, 23661 cycles to hash an empty string. A little slower for unaligned inputs. The internal state is 6672 bytes.

August 19 2008 nisthash17.c I revised the key and length scheduling. Now the key is variable length, but a multiple of 16 bytes. The key is hashed, then the data is hashed (zero-padded to a byte boundary), then the key is hashed again, then 2 bytes 16*keylen+2*padding+1 is added (both lengths measured in bits), then the final block is zero padded and hashed. reference.c, reference.h also implement it using the NIST APIs. They're slower, because it don't use SSE and didn't unroll loops to precompute constants, and slower yet on 32-bit platforms where the compiler is interpreting 8-byte ints with 4-byte operations. NIST required the length to be given in bits. Also optimized.c, optimized.h which implement the NIST APIs and are the same speed as nisthash17.c. They all compute the same hash. 5.3 cycles/byte, 23077 cycles to hash the empty string.

August 22 2008 The submission is ready. The hash will be named Maraca. I'll mail it off tomorrow morning. Here's the algorithm specification.

October 5 2008 NIST responded with a bunch of deficiencies. Most of them are checkbox sorts of things, but some are tricky. The worst is insufficient explanation of constants (ie you can't deterministically derive the constants), which I can't fix. It's a big space. My exploration of it depended on several months of tweaking programs, exactly when blackouts stopped search programs, bugs in helper programs, and so forth. Another serious deficiency was lack of cryptanalysis. I knew about that one. Again, it's a big space, I can't prove much of anything about what can't be done. But I can do a trial cryptanalysis, so I'm doing that.

In the process I noticed my shift constants don't obey the rules I said they did. I said they were i+8*j, where all i in 0..7 and j in 0..7 are covered twice. That wasn't true, and there were two duplicate shift amounts (and a shift by zero) roughly because of that. So I'm replacing the shift amounts, which changes the hash function. I'll have to redo the tests and submit the new function.

October 21 2008 One of the deficiencies cited was that I hadn't addressed smaller message digests directly. Looking at that, I realized that since the last use of the last block isn't mixed at all, a delta in only the high bits won't affect a truncated result. Reanalyzing the pattern with the last use of the last block omitted, it guarantees only 24 permutations. This can be hit by deltas in blocks 2,3,7. More, every block use in the last 30 permutations is suspect of not fully affecting the result. I'll fix that by doing 30 permutations of strengthening before reporting the result. It's tempting to XOR the key at the end, too, so those 30 permutations cannot be rolled back. But that would require analyzing the key differently than any other data, so I didn't do it. Other ciphers appear to have some final strengthening as well, so this hack is in good company.

I've found replacement constants for map[] and shift[]. They obey the rules and they give better measures than the old constants.

October 22 2008 I'm resubmitting it: nist.

This is how to use a permutation to build a hash function. The formula requires p permutations of work per data block. The formula guarantees all deltas must pass through at least d permutations. The formula requires b data blocks to be kept in memory at once, and requires each block to be used u times. In all cases, the +1 for everything but the first use guarantees that a delta involving q blocks will require at least q permutations, so the guarantee is really max(d,q). Replacing each permutation with n permutations multiplies the guarantee by n. There may be some other guarantee that would be more useful than these, but by doing some overkill with these any other useful guarantee could be met as well. Or similar searches could come up with patterns to satisfy other guarantees directly. What guarantee is useful depends on the use and the permutation. Most of these I searched up to 10 out of 40 blocks, and up to 7 out of 80 blocks. The formulas may fail beyond that.

d	p	u	b	formula	date
1	1	1	1	i	Sept 4 2008
1	2	2	1	2i, 2i+1	Sept 3 2008
3	2	2	2	2i, 2(i+1)+1	Sept 3 2008
4	2	2	3	2i, 2(i+2)+1	Sept 3 2008
5	2	2	7	2i, 2(i+6-4(i%2))+1	Sept 3 2008
6	2	2	12	2i, 2(i+11-4(i%2))+1	Sept 3 2008
6	2	3	6	2i, 2(i+4-4(i%2))+1, 2(i+6)+1	Sept 3 2008
7	2	2	20	2i, 2(i+19-4(i%4))+1	Sept 3 2008
7	2	3	8	2i, 2(i+5-4(i%2))+1, 2(i+7)+1	Sept 3 2008
8	2	2	27	2i, 2(i+26-4(i%4))+1	Sept 3 2008
8	2	3	9	2i, 2(i+6-4(i%2))+1, 2(i+8)+1	Sept 3 2008
9	2	3	10	2i, 2(i+5-4(i%2))+1, 2(i+9)+1	Sept 3 2008
10	2	3	11	2i, 2(i+6-4(i%2))+1, 2(i+10)+1	Sept 3 2008
11	2	3	20	2i, 2(i+2+4(i%4))+1, 2(i+19)+1	Sept 3 2008
12	2	3	21	2i, 2(i+4+4(i%4))+1, 2(i+20)+1	Sept 3 2008
13	2	3	28	2i, 2(i+14-4(i%4))+1, 2(i+27-4(i%2))+1	Sept 3 2008
14	2	3	33	2i, 2(i+14-4(i%4))+1, 2(i+32-4((i+2)%4))+1	Sept 3 2008
15	2	4	21	2i, 2(i+3)+1, 2(i+13-8(i%2))+1, 2(i+20)+1	Sept 6 2008
15	2	4	26	2i, 2(i+22-6(i%4))+1, 2(i+23-6((i+2)%4))+1, 2(i+25)+1	Aug 19 2008
16	2	3	40	2i, 2(i+14-4(i%4))+1, 2(i+39-8((i+1)%2))+1	Sept 6 2008
16	2	4	24	2i, 2(i+14-8(i%2))+1, 2(i+4)+1, 2(i+23)+1	Sept 6 2008
17	2	4	29	2i, 2(i+14-4(3i%4))+1, 2(i+22-8(i%2))+1, 2(i+28)+1	Sept 6 2008
18	2	4	32	2i, 2(i+10-8(i%2))+1, 2(i+27-4(3i%4))+1, 2(i+31)+1	Sept 6 2008
19	2	4	34	2i, 2(i+11-8(i%2))+1, 2(i+27-4(i%4))+1, 2(i+33)+1	Sept 6 2008
20	2	4	37	2i, 2(i+17-4(3i%4))+1, 2(i+32-4(i%4))+1, 2(i+36)+1	Sept 11 2008
20	2	4	39	2i, 2(i+21-6(3i%4))+1, 2(i+34-4((i+1)%4))+1, 2(i+38)+1	Sept 11 2008
21	2	4	42	2i, 2(i+11-8(i%2))+1, 2(i+35-4(3i%4))+1, 2(i+41)+1	Sept 11 2008
21	2	4	43	2i, 2(i+19-12(i%2))+1, 2(i+38-4((3i+1)%4))+1, 2(i+42)+1	Sept 11 2008
22	2	4	46	2i, 2(i+11-8(i%2))+1, 2(i+29-4(3i%4))+1, 2(i+45)+1	Sept 11 2008
23	2	4	48	2i, 2(i+13-8(i%2))+1, 2(i+39-4(i%4))+1, 2(i+47)+1	Sept 11 2008
23	2	4	49	2i, 2(i+19-12(i%2))+1, 2(i+45-4((3i+2)%4))+1, 2(i+48)+1	Sept 11 2008
24	2	4	53	2i, 2(i+19-12(i%2))+1, 2(i+45-4((i+1)%4))+1, 2(i+52)+1	Sept 11 2008
24	2	4	53	2i, 2(i+16-4(3i%4))+1, 2(i+41-8(i%2))+1, 2(i+52)+1	Sept 11 2008
25	2	4	61	2i, 2(i+22-6(i%4))+1, 2(i+48-8(i%2))+1, 2(i+61)+1	Sept 20 2008
26	2	4	62	2i, 2(i+22-6(i%4))+1, 2(i+49-8(i%2))+1, 2(i+62)+1	Sept 20 2008
27	2	4	63	2i, 2(i+20-16(i%2))+1, 2(i+51-6(3i%4))+1, 2(i+63)+1	Sept 20 2008
30	3	4	47	3i, 3(i+21-6(i%4))+1, 3(i+41-6((i+2)%4))+1, 3(i+46)+1	Aug 19 2008

ISAAC, a cryptographic pseudorandom number generator
A small-state pseudorandom number generator for simulations
A noncryptographic software hash function for table lookup
Specialized hashes of a 32-bit integer
brute.c and gather.c, tools for breaking up to 5-bit RC4 starting in the middle of the stream
a mostly-useless tool that uses Gaussian Elimination to locate characteristics of block ciphers
Table of Contents