**Maraca has been submitted to NIST, rejected by NIST, and later
utterly broken by Sebastiaan Indesteege. A message hashing to an
arbitrarily chosen hash value can be derived in about two seconds of
computing time.**

NIST is having a competition to define a new cryptographic hash to replace SHA-2 (they'll dub it SHA-3). I think I'll enter, if I can come up with something worthwhile. This page is a work-in-progress of the hash I'm currently considering. I'm weak on hardware and hash cryptanalysis, so I expect I'll be revising these hashes considerably as I learn more.

**(Feb 10 2008)**
My first attempt, nandhash.c, isn't
crypto at all, it's just attempting to do an efficient hardware hash.
It hashed 128 bits arranged in eight 4x4 blocks, with alternating
4x4 bit mixes and XORs between bits in neighboring blocks. The 4x4
bit mix consisted of an LFSR of 8 of the bits (3 or 4 XORs per bit),
and (~s)^((w&x)|(y&z)) for the remaining 8 bits (for some
w,x,y,z chosen from the preceding bits). In software, the 8 4x4
blocks can be mixed simultaneously by storing each of the 16 bits
of each in 16 8-bit words, then manipulating the words.

NIST wants at least a 512-bit hash value. NIST also wants the
ability to parameterize the hash with a 512-bit secret key. That
dictates at least a 512-bit state and 512-bit key. I'm guessing I'll
do this with a 1024-bit state, with a 1024-bit mix, where I XOR 512
bits of message to one side and 512 bits of key to the other side,
then mix, with some setup and wrapup procedure. I expect to hash
2^{20} byte subsequences, then hash the 1024-bit states of
the subsequences if messages are longer than that, so that the hash
can make use of massive parallelism for large streams. Or perhaps I
should only XOR 128 bits of message and key at a time, but 4x as
often?

One attack on crypto hashes is to find x that produces pattern y in reverse and pattern z forward. Then if you start with pattern y, run it forward you'll get x, then run it forward more you'll get z. So if it takes m rounds backwards to achieve avalanche, and n rounds forward, then any less than m+n rounds is clearly not good enough for a crypto hash. The right number of rounds is probably double or triple that, but m+n is the largest number of rounds I can easily show is no good. It's a good metric for comparing mixing functions.

**(Feb 11 2008)** The next hash,
nisthash1.c, was the
same 4x4 mix, but there were 8x8 4x4 blocks (1024 bits in all). The
4x4 mix is now done in software with 64-bit words instead of 8-bit
words, since there are now 64 4x4 blocks instead of 8 4x4 blocks.
Horizontal and vertical mixing XORed all 8 corresponding horizontal and
vertical bits, then XORed that again to the original bit. I tested
one-bit diffs against random bases. It took 5 rounds forward and 5
rounds backward to achieve avalanche. The 4x4 mix is 6 nands deep,
and the XOR is 9 nands deep, so (5+5)*(6+9)=150 nands deep minimum.

Nisthash1.c did awful if the initial state was mostly zero or
mostly one. It also had symmetry between all the 4x4 blocks, which
was very bad *forever* when the initial state was symmetric.

**(Feb 18 2008)** Next came the imaginatively named nisthash2.c.
This broke symmetry by
complementing s in (~s)^((w&x)|(y&z)) only half the time.
In software that was accomplished by XORing a constant to s. I chose
"half the time" in a very regular way that guaranteed each 4x4
mix differed from every other one in at least 2 ways. Randomly chosen
constants worked better than that very regular way. But NIST wants
the design decisions to be fully explainable. I expect I'll
eventually resort to randomly chosen constants, randomly generated by
my small state prng with some
seed. I've been using that a lot for hillclimbing. Also in this
nisthash2.c, I allowed w,x,y,z to be optionally complemented. I
tested this with almost-all-zero and almost-all-one states, as well as
with random states. It took 6 rounds backwards and 6 rounds forwards
for avalanche, so (6+6)*(6+9) = 180 nands deep minimum.

**(Feb 19 2008)** The next, nisthash3.c,
was basically the
same as nisthash2.c, but it had a slightly better mixing function. I
noticed that although I allowed the top eight bits to nonlinear mix
any bit before them, the best functions usually used only the first
eight bits, so in nisthash3.c I added that restriction, w,x,y,z had to
come from the first eight bits. About the same time I notice a bug in
my hillclimber that prevented it from using the previous bit, which
might also explain why the first eight bits were favored. Still
(6+6)*(6+9) = 180 nands deep minimum.

**(Feb 22 2008)**
Several people have told me that wire length doesn't matter much,
it's nand depth I should pay attention to. I don't have a hardware
simulator to tell me one way or the other. But if wire length is
irrelevant, perhaps I should eliminate the XORs in horizontal and
vertical mixing. nisthash4.c
does that. I
also realized that the s side of s^((w&x)|(y&z)) isn't many
nands deep, so I could get more mixing in by doing
(s^t)^((x&x)|(y&z)). So I do an LFSR with 1 or 2 taps for the
top 8 bits, as well as an LFSR with 3 or 4 taps for the bottom 8
bits. To keep this reversible I could only use w,x,y,z from the
bottom 8 bits. 6 rounds backwards or 8 rounds forward for avalanche,
so (6+8)*(6) = 84 nands deep minimum. Unfortunately this got rid of
my symmetry-breaking, and the hash is horribly symmetric again.

Is 84 nands deep good? What's the best possible? XOR is 3 nands deep, and you need a 10-level tree to combine 1024 bits, so XOR trees forward and backwards would be (10+10)*3 = 60 nands deep. At 3 nands deep, XOR could be combining 8 variables, but it's only combining 2. If you look at (s^t)^~(~(w&x)&~(y&z)), the XORs are giving you avalanche, and the rest is giving you nonlinearity. It's 6 nands deep. Is there some sort of permutation that could give you nonlinearity and avalanche simultaneously that's less than 6 nands deep? I don't know. I don't think anybody has checked.

I wrote a program, balancenand.c (March 7 2008), to come up with all balanced binary functions that are n nands deep or less. After it generates the truth tables, I run sort and uniq on them. There are 116 distinct truth-tables for balanced binary functions 3 nands deep or less (none uses more than 4 variables), and 2,094,862 distinct truth-tables for balanced binary functions 4 nands deep or less using 6 variables or less. None of them do avalanche quite as well as XOR, but some of the 5-variable functions come close. 4 nands deep, 7 variables or less, has 26,771,332 distinct truth tables.

**(March 13 2008)**
Another program, nandperm.c,
reads in a file of truth tables and randomly hillclimbs trying
to build a permutation out of them. It picks one bit to replace,
checks if a new balanced function can replace it, and if it can it also
checks if the new balanced function can extend the permutation without
replacing it. Once a whole permutation is made, it continues, replacing
one bit function at a time, walking through many related permutations.
I don't know if the graph of permutations is connected that way.

**(March 17 2008)** For example, here is its report for a 4-nand-deep
permutation of 16 bits:

4 0, 1, 5,14 1111000100011100 4 2, 4, 6, 8 1010101001010110 4 1, 4, 6, 7 1111000000011110 4 2, 4, 6, 8 1111000100001110 5 0, 1, 5,12,14 10001001100011111001100110011001 5 5, 9,10,11,15 00000000111111110111011100001010 4 0, 1, 4, 5 1111000000011110 4 0, 1, 3, 6 1100001111010010 5 5, 9,10,11,15 00001111000011110101001011110010 5 5, 9,10,11,15 00110011001100110100110001101110 5 5, 9,12,13,15 11110000000010111111010100001011 4 0, 5,12,14 1010010110110010 5 0, 1, 5,12,14 00011100000111000001111011011110 4 0, 3, 4, 7 1010101101010100 5 5, 9,12,13,15 11110001111101010000101000001011 5 0, 1, 5,12,14 11111111000000000001010110101011 |

**(March 15 2008)** It turns out 3-way XOR can be implemented in a tree
only 5 NANDs deep. Does this generalize to N-way XOR being
2*ceil(log_{2}(N))+1 NANDs deep, rather than
3*ceil(log_{2}(N)) NANDs deep? If it does, does this require an
exponential number of gates? I'm not sure yet. I wish I
knew whether 2-way NANDs or 2-way NORs or 3-way NANDs or 4-way NORs or
whatever were the standard component for circuits.

**(March 16 2008)** To measure the avalanche of a 0/1 function, for
every input value, flip each input bit and add to a total if the the
result changes, then divide the total by the number of input values.
The maximum avalanche for an n-bit 0/1 function is n, and it is
achieved by the XOR of those n bits. Consider a function !(x&y),
where x and y are themselves 0/1 functions. What can we say about x
and y if !(x&y) is balanced and has avalanche > Z? Some zeros
come from x, some from y, some from both x and y. If x has (.5-q)
zeros, then y must contribute exactly q zeros that aren't shared by
x (note q is a fraction of the total, not a raw count). The avalanche
of !(x&y) is at most the sum of the avalanche of x and the avalanche
of y. And this is achieved only if x contributes all its zeros and y
also contributes all its zeros, which is only possible when the zeros
of x and y sum to .5. Suppose x is .5-q zeros but y is r zeros, where
r > q. It's reasonable to expect at most q/r of the avalanche from
y to be contributed to !(x&y), total avalanche probably under
(avalanche x) + (q/r)(avalanche y). The avalanche in y might be
concentrated in a subsets of the zeros, so we can do better, but if
we're picking functions that have unusually high avalanche, this is
unlikely. If the input bits that x and y depend on don't overlap, for
!(x&y) to be balanced, we need (ones in x)*(ones in y) = .5, so
both are sqrt(.5) on average. Similarly avalanche would be
sqrt(.5)(avalanche in x + avalanche in y).

**(March 22 2008)** 2^{n}-way XOR can be implemented only 2n+1
NANDs deep. The 1 is to turn all inputs into themselves and their
complements: (a,!a) is implemented as (a, !(a&a)). Given
(a,A),(b,B), where a and b are each the XOR of n variables and A,B are
their complements, ((a^b),!(a^b)) is implemented as
(!(!(a&B)&!(A&b)), !(!(a&b)&!(A&B))). This
means XOR is quite a bit more competitive than I thought it was at
achieving avalanche. My surveys so far find nothing 4-deep or 5-deep
that approaches the ability of XOR to achieve avalanche. An XOR of 4
variables is 5 deep and has an avalanche of 4. The best 5-deep
non-XOR I've found so far uses 5 variables and has avalanche of 3.5,
10100011010111000101110010100011. It's clearly a variant on XOR. XOR
increases avalanche 1.4142x
per level. Random trees with the golden ratio of bits set (which causes
!(a&b) to also have the golden ratio of bits set) increases
avalanche by about 1.236x per level, well short of XOR. I'll search
5-deep balanced functions a bit more, but at the moment I believe
that Feistel networks are about the optimal way to achieve avalanche
out of all possible NAND trees.

**(March 27 2008)** I realized that hillclimbing to find an initial 16-bit
permutations (which takes days) can be skipped by instead initializing
the hillclimber with two 8-bit permutations (which takes a few seconds).
Here's a 5-deep 16-bit permutation where every balanced function has
avalanche between 3.25 and 4.0. (There are no 6+ variable balanced
functions that are 5-deep with avalanche greater than 3.125.) I'm not
claiming it's a particularly good permutation, it's just *a* permutation.

1, 3, 4, 5, 6 10111000010001110100011110111000 (((1 (1 2)) ((0 0) (1 1))) ((3 4) ((3 3) (4 4)))) (((0 (0 1)) (1 2)) ((3 (3 4)) (4 (3 3)))) 1, 3, 4, 5, 6 10001011011101000111010010001011 (((0 (0 1)) (1 (1 2))) ((3 (3 4)) (4 (3 3)))) (((1 2) ((0 0) (1 1))) ((3 4) ((3) (4 4)))) 0, 2, 3, 7, 9 11100100000110110001101111100100 (((0 1) (2 (0 0))) ((3 (3 4)) (4 (3 3)))) (((0 (0 1)) ((0 0) (2 2))) ((3 4) ((3 3) (4 4)))) 0, 3, 5, 7 1001011001101001 (((0 1) ((0 0) (1 1))) ((2 3) ((2 2) (3 3)))) (((0 (1 1)) ((0 0) 1)) ((2 (3 3)) ((2 2) 3))) 3, 4, 5, 6,13 10010110100101100110101101100001 (((0 (1 4)) (2 (3 4))) ((1 (0 2)) (4 (0 2)))) (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3)))) 2, 5,10,12,13 10100101010110101001011010010110 (((0 2) ((0 0) (2 2))) ((4 (1 1)) ((1 4) (3 3)))) (((0 (0 2)) (2 (0 0))) ((1 4) (3 (3 4)))) 0, 4, 5, 6 1001011001101001 (((0 1) ((0 0) (1 1))) ((2 3) ((2 2) (3 3)))) (((0 (1 1)) ((0 0) 1)) ((2 (3 3)) ((2 2) 3))) 0, 3, 4, 6 1001011001101001 (((0 1) ((0 0) (1 1))) ((2 3) ((2 2) (3 3)))) (((0 (1 1)) ((0 0) 1)) ((2 (3 3)) ((2 2) 3))) 1, 3, 5, 6, 7 10101001010101100110010110011010 (((0 3) ((0 0) (3 3))) ((1 (1 2)) ((1 1) (4 4)))) (((0 (0 3)) (3 (0 0))) ((1 2) (4 (1 1)))) 2, 4, 8, 9,12 10100101010110101001011010010110 (((0 2) ((0 0) (2 2))) ((4 (1 1)) ((1 4) (3 3)))) (((0 (0 2)) (2 (0 0))) ((1 4) (3 (3 4)))) 4, 6,11,12,15 10011001011001100110001011011001 (((0 (1 4)) (3 (1 4))) ((1 (0 3)) (4 (0 2)))) (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3)))) 0, 4, 7,12,15 10010110110111100010000101101001 (((0 (0 2)) (2 (0 0))) ((1 (1 4)) (3 (1 1)))) (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3)))) 0, 4,11,12,14 10010010011000011101011011101001 (((0 3) ((0 0) (3 3))) ((1 2) ((1 1) (2 2)))) (((0 (0 3)) (3 (0 0))) ((1 (1 2)) (4 (1 1)))) 1, 4, 5, 6,13 00011010001001011011010101111010 (((0 (0 3)) (3 (0 0))) ((1 (1 2)) (2 4))) (((0 2) ((0 0) (2 2))) ((1 (1 3)) (3 (1 1)))) 8, 9,11,13,14 00111100001011011100001111010010 (((0 3) ((0 2) (1 1))) ((2 (2 4)) (4 (2 2)))) (((1 (0 3)) (2 (2 4))) ((2 4) ((1 2) (1 4)))) 2, 4, 6, 7,10 10010110011011011001011001100001 (((0 (1 3)) (2 (1 3))) ((1 (0 2)) (3 (2 4)))) (((0 2) ((0 0) (2 2))) ((1 3) ((1 1) (3 3)))) |

**(April 9 2008)** Here's a permutation that can claim that the difference
between its mapping for 0 and its mapping for a value with 1 bit set will
always have at least 2 bits different. That's something I couldn't
accomplish with LFSRs and Feistel networks. NAND depth 5.

5 4, 9,10,12,14 10010110011110011000011001101001 5 3, 4, 5, 8,10 01111001100101110010100110010010 5 0, 1, 3, 6,13 10100101010110101001011010010110 5 1, 7,10,14,15 10010110100101000110101101101001 5 3, 7,10,11,15 10010110011110011001011000101001 5 2,10,12,14,15 10010101011010101010011001011001 5 0, 1, 6, 9,11 10011001111101100110011010010000 5 1, 2, 3, 6,13 00111100110000111001011010010110 5 0, 1, 3, 9,13 10011001011001100111001010110001 5 2, 3, 4, 7,10 10010110100100100110110101101001 5 0, 6, 9,11,13 01110110110110011001100001100010 5 5, 8,11,14,15 00111100110000111010001100111010 5 0, 6, 9,11,13 11010001011101000011110011000011 5 0, 6, 9,11,13 10010100011000011001011101101101 5 0, 2, 6, 9,11 00011100110100111100000100111101 5 0, 2, 3, 5, 8 10111000010001110100011110111000 |

**(April 10 2008)** To verify these 16-bit permutations are
permutations, I'm checking
all 2^{16} values. Suppose I wanted to build a 512-bit
permutation, it wouldn't be feasible to check all 2^{512}
values. But, suppose each output bit depends
on at most 5 input bits. Does that help? I think so. If I want to
change the function f(a,b,c,d,e) for the last bit to g(a,b,c,d,e), I
just have to find all output bits that have a,b,c,d,e as their inputs
and evaluate them. So I map a bit to its inputs, back to all their
outputs, then back to *their* inputs, and I go through all values
for all those input bits. I think if the result is balanced then it's
still a permutation with g(a,b,c,d,e) replacing f(a,b,c,d,e). I
think that shows a,b,c,d,e are still happy, and it can't break
anything other than a,b,c,d,e.

If that selects fewer than all
input bits, that would be faster than evaluating all input bits.
Doing three mappings like that could depend on up to 89 bits if both
directions had a branching of 5. But inputs to outputs have a
variable branching, average 5 sure, but some inputs are used by many
more than 5 outputs. You could easily get unlucky and need to
evaluate all possible settings of over a hundred of bits. That's a
large theoretical improvement over 2^{512} values, but even
2^{89} isn't feasible in practice. So, that optimization
looks like a dead end.

Perhaps if I calculate the inverse of the permutation, then I'd only have to do two mappings, from the output to be changed back to the inputs back to the outputs? That's up to 21 output bits with a constant branching of 5, which is close to feasible. So far I have no idea how to make that work. I suspect it can't be made to work.

**(April 11 2008)** fivebalanced.txt is
a list of a few thousand of the best-mixing 5-deep balanced functions.

**(May 3 2008)** Some of the better-mixing 16-bit permutations I've
seen are actually two 8-bit permutations followed by a shuffle of all
16 bits. My framework has 64 of the 16-bit chunks to permute, which
maps nicely to a 64-bit integer for each of the 16 bits. If I limit
the hash to a single 8-bit permutation repeated twice, then I can
compute that 16-bit permutation using SSE2 instructions, 128 bits at a
time. I'd still have to split it into 16 64-bit integers to shuffle
the bits and do the per-bit rotates.

**(July 22, 2008)** I'm using tree2c.c to
convert 8-bit permutations to C code. They vary from 65 to 130
& | ^ ~ operations. For example

u4 truth1[] = { 0xad54659a, 0x00006996, 0x9648b769, 0x00009669, 0x987662d9, 0x9696eb28, 0xd54aa659, 0x00006996,}; u4 mask1[] = { 0x0d6, 0x02b, 0x0d3, 0x03c, 0x0ab, 0x0d6, 0x0d6, 0x02d,}; void x1(u8 *v) { u8 a0 = v[0]; u8 a1 = v[1]; u8 a2 = v[2]; u8 a3 = v[3]; u8 a4 = v[4]; u8 a5 = v[5]; u8 a6 = v[6]; u8 a7 = v[7]; u8 a8 = a1 ^ a6; u8 a9 = ~a7; u8 a10 = a2 & a9; u8 a11 = a1 & a7; u8 a12 = a11 | a4; u8 a13 = ~a12; u8 a14 = a10 | a13; u8 a15 = a8 & a14; u8 a16 = a1 & a6; u8 a17 = a4 & a6; u8 a18 = a1 | a17; u8 a19 = ~a18; u8 a20 = a16 | a19; u8 a21 = a2 & a7; u8 a22 = ~a2; u8 a23 = a4 & a22; u8 a24 = a21 | a23; u8 a25 = a20 & a24; u8 a26 = a15 | a25; u8 a27 = a0 ^ a5; u8 a28 = a1 ^ a3; u8 a29 = a27 ^ a28; u8 a30 = a0 ^ a4; u8 a31 = a30 & a8; u8 a32 = a1 | a7; u8 a33 = ~a16; u8 a34 = a32 & a33; u8 a35 = a30 | a34; u8 a36 = ~a35; u8 a37 = a31 | a36; u8 a38 = a2 ^ a5; u8 a39 = a3 ^ a4; u8 a40 = a38 ^ a39; u8 a41 = ~a40; u8 a42 = ~a5; u8 a43 = a0 & a42; u8 a44 = ~a0; u8 a45 = a3 & a44; u8 a46 = a43 | a45; u8 a47 = a1 ^ a7; u8 a48 = a46 & a47; u8 a49 = a27 | a47; u8 a50 = ~a49; u8 a51 = a48 | a50; u8 a52 = a2 ^ a4; u8 a53 = a47 & a52; u8 a54 = a6 & a9; u8 a55 = a11 | a54; u8 a56 = ~a52; u8 a57 = a55 & a56; u8 a58 = a53 | a57; u8 a59 = ~a4; u8 a60 = a2 & a59; u8 a61 = a1 & a4; u8 a62 = ~a61; u8 a63 = a7 & a62; u8 a64 = a60 | a63; u8 a65 = a8 & a64; u8 a66 = a2 | a7; u8 a67 = a2 & a4; u8 a68 = ~a67; u8 a69 = a66 & a68; u8 a70 = a8 | a69; u8 a71 = ~a70; u8 a72 = a65 | a71; u8 a73 = a2 ^ a3; u8 a74 = a27 ^ a73; v[0] = a26; v[1] = a29; v[2] = a37; v[3] = a41; v[4] = a51; v[5] = a58; v[6] = a72; v[7] = a74; } |

**(July 24, 2008)** I spent a day scanning papers looking for the
standard way to associate a key with a permutation to form a block
ciphers. Near as I can tell, there isn't any. k0^f(k1^block) was
proposed for DESX, but that requires twice the ideal amount of key,
and doesn't play well with real f() which do some XORing of their
own.

Should I trickle in data, or XOR 1024-bit blocks all at once?
Trickling has the advantage that there are parts of the input you
can't overwrite. Big blocks have the advantage that you can do more
work between writes. Attacks often involve, oh, 8 bit deltas. If you
only overwrite half the block, there's a 2^{-8}=1/256 chance
that an interesting 8-bit delta will be entirely in the half you're
overwriting. The security gain from doubling the number of rounds is
much more than a factor of 256. So big blocks win.

I'm building a hash. It doesn't have to permute the input blocks. I could do m0 f k f m0 f m1 f k f m1 f m2 f m2. It loses about a bit of entropy from m0, but that's OK, I've got a 1024 bit state and I only report at most 512 bits of it. I could do m1 f k f m0 f m2 f k f m1 f m3 f k f m2 f ..., that puts 2x the work between the two instances of a block instead of f. If there's an attack involving m1 and m2, then m1 and m2 get repeated twice here, it's symmetric, the same attack might hold. So I could do m1 f k f m0' f m2 f k f m1' f m3 f k f m2' f ..., where x' is x with the top and bottom half swapped. Since the top and bottom half are hashed entirely differently (they feed into different bits of the 16-bit permutation), any common useful delta between m1,m2 and m1',m2' will be coincidence. If I've got avalanche in n rounds, I recon I need about 4n rounds for cryptographic strength. I don't have to do m1 m0' m2 m1' m3 m2', I could do m3 m0' m4 m1' m5 m2' m6 m3', or whatever I need to to put the two m1's 4n rounds apart. I think that leaves a flaw in (m2,m3), and an unrelated flaw in (m2',m3'), as the worst attack. I think that has about the same probability as a flaw in (m2,m3) with twice as many rounds. That suggests m2,m3 should be 2n rounds apart, yielding m2 f k f m0' f m3 f k f m1' f m4 f m2' f, where f is 5 rounds. A round is 5 nands deep, an XOR is 3 nands, 3 is just barely negligible compared to 5*5, this looks about right. That's 84 gates deep per 1024 bits, hey, not bad. 69 gates if I can do avalanche in 6 rounds. 4*1024bits = 512 bytes of storage.

**(July 24 2008)** Let's take that a little more to the extreme.

f f f f f f f f f f f f f f f f f k k k k k k k k k k k k k k k k k k 1 2 1 2 1 2 13 14 13 14 13 14 13 14 13 14 13 14 3 4 3 4 3 4 3 4 15 16 15 16 15 16 15 16 15 16 5 6 5 6 5 6 5 6 5 6 17 18 17 18 17 18 17 18 7 8 7 8 7 8 7 8 7 8 7 8 19 20 19 20 19 20 -3 -2 9 10 9 10 9 10 9 10 9 10 9 10 21 22 21 22 -1 0 -1 0 11 12 11 12 11 12 11 12 11 12 11 12 23 24 |

I'm back to testing whole permutations. Here's a random one, nisthash5.c, similar to nisthash[1..4].c. At the moment it looks like I need 7 or 8 f's for avalanche, not 6. So far these nand-trees are looking worse than the old Feistel networks.

**(July 25 2008)** I switched from combining every block 6 times to
combining every block 10 times, and using 3 rounds between combines
instead of 5. I've got a complete hash now,
nisthash6.c.
It can like hash strings and everything. This runs in 12 cycles per
byte in visual studio C++ on an Intel quad core (about 4 for
combining blocks and 8 for mixing).

**(July 26 2008)** I optimized that a bit this morning (it started at
22 cycles per byte, then 18, then 13, got it to 12 this morning and
update nisthash6.c). Next up is looking for characteristics? No, not
yet. Next up is looking into SSE and what is required to use it. I
see it requires 16 byte alignment, and likes to load and store 128
consecutive bits. I've got to rearrange my two 8-bit permutations so
that the 64 low-0 bits and 64 high-0 bits are consecutive, so SSE can
do all 128 permutations at once. I may be able to use SSE in
combining the data as well, but that may require 8 distinct shift
amounts instead of 10 (is that OK?).

The key and message blocks are all 16 64-bit values. If you XOR a 64-bit constant Q to all of them, and there's exactly two message blocks, you'll get the same hash because the key and message blocks are XORed (rotated by multiples of 64 bits) and the Q will cancel out. Easily solved: make the key 512 bits instead of 1024, and treat the high 512 bits of the key as zero.

Related. Suppose you have blocks A,B,C,D and you xor Q to all their 64-bit quantities. You'll get the first (A,B) gap and the final (C,D) gap, but in between A^C and B^D will cancel out Q. That's a total of 6 rounds to find differentials for (well short of 30). The root of this problem is that linearity in the combine allows differentials to be shuffled among blocks, probably using f(B) as well as B. That f counts towards the 30 rounds, but only once, no matter how many times f(B) is used.

**(July 27 2008)** OK, the game with block scheduling is for any set
of blocks, you draw line segments connecting combines that use those
blocks. Segments are allowed to pass through combines with one or
more block. A combine with two or more blocks can be treated as a
line segment of zero length. This tells me how many rounds of f need
differentials found for them. The first and last block will always
contribute 1 gap each, and using f(B) as well as B adds the equivalent
of 1 gap every 3 blocks, so only (10-2)*3=24 blocks at most can
be a threat. I need to confirm every subset of 24 or fewer blocks
requires at least 10-#blocks/3 line segments.

According to spacing.c, which brute-force tries all sets of up to 8 blocks, I can accomplish that by combining block i at at offsets i, i+12-5(i%3), i+28-5((i+2)%3), i+31. I'll try to brute-force 9 blocks as well, but I don't have the resources to do more than that. Can I skip f(B)? No, that's what makes this problem finite. Otherwise you apply linear algebra and you'll find some very complicated way to have only 6 rounds to deal with.

Tracking 32+2 128-byte blocks uses just over 4K of RAM. Using f(B) as well as B requires an extra f() per block, making it 4 instead of 3 applications of f(). That will slow down the hash some in software, and require a duplicate implementation of f() in hardware. On the other hand, I'm only using each block 4x, which reduces the combining work, and eliminates my worry about combining blocks 10x but only having 8 128-bit rotates that can be done with SSE.

I've revised the hash to handle all that, nisthash7.c, but it's still not using SSE. 22.7 cycles/byte with 32-bit gcc, 11.3 cycles/byte on 64-bit Vista with visual studio. (Whoa, it got faster, perhaps I did something stupid and it's optimizing away all the combines?).

**(July 31 2008)** I tried using SSE for the mix and the
combine. That brings it down to 7.6 cycles/byte in Visual Studio 2008
on my Vista quad core (nisthash82.c), and
10.9 cycles per byte with gcc on my 32-bit dual core XP machine (nisthash8.c). Unfortunately nisthash8.c gives
different results via gcc and visual studio, and nisthash82.c only
compiles under visual studio (and gives a third set of results). I'm
still not doing anything but plain C for the 64-bit rotates. I think
I should look at correctness next.

**(August 2 2008)** After fixing some bugs, nisthash9.c is back up to 7.7
cycles/byte under visual studio on my Vista quad core, and 8.4
cycles/byte under gcc on my XP dual core. I switched the permutation
because the previous 8-bit permutation had a deterministic delta. I
switched the key schedule too, I had a bug in spacing.c. It gives
the same results on both platforms now. Now I'm using i, i+23-6*(i%4),
i+26-6*((i+2)%4), i+31, which requires at least 10 gaps for up to 9
variables.

**(August 4 2008)** nisthash10.c. I
switched the permutation and 16-bit shuffle and shifts again. 7.6
cycles/byte visual studio, 7.9 cycles/byte gcc. The 16-bit mix 3x gives
0.375 avalanche. Due to XOR, there are fixed points for individual
16-bit chunks, but each chunk has different ones. At the 1024-bit
level, every bit is affected by 8 rounds forward or 7 rounds backwards
(0.27, 0.29),with nearly-zero and nearly-one base states, and 7 rounds
forward or 6 rounds backwards (0.378, 0.17) with random base states.

**(August 6 2008)** nisthash11.c.
I'm thinking about what happens if you have a message that is just
block B repeated over and over. After 31 combines, you'll be combining
B^B^f(B)^f(B)^key every round, which is just the key. With a zero key
that's no combine at all. If you change the message by adding more
copies of B, those intermediate states are just the effect of 3 rounds
on the internal state. But that's after 31 combines, which is 93
rounds. If you try to change the key to control this, that's 93
rounds where the key will be scrambling the internal state, so I don't
see how to exploit this. Still, I think I'll change it to
B^f(B)^f(B)^f(B)=B^f(B) instead of B^B^f(B)^f(B)=0. Those instances
of the blocks were actually rotated by different amounts, but still.

**(August 8 2008)** nisthash12.c.
Fixed bugs in the block scheduling. Shifted length by 3 when combining
so it can handle a length in bits.

I ran my characteristic-finding code for one, two, three, four, five iterations of perm(), and it didn't find anything. I only did 50000 trials with 10000 tests of each possible characteristic, and the code is easily confused anyhow. I looked for deterministic deltas for one round. There aren't any, other than the zero delta. I tried all two-bit deltas with random bases, they do avalanche in 7 rounds forward and 6 rounds backward too.

Assuming the block scheduling works so deltas must make it
through 30 rounds, then a delta with probability 3/4 per round
(there are such deltas for single rounds, but they can't be chained)
yields a 30-round
output delta of 2^{-12} probability. If I could show no chain
of deltas gives more than 1/2 probability per round, that's
2^{-30} probability. I need to show no chain of deltas
gives 2^{-17} probability per round to reach 2^{-512}
probability for the full 30 rounds. It seems unlikely that I could
show that. I don't think I can enumerate all deltas with
2^{-17} probability for a single round. But if the average
input delta in a chunk has no output delta with
probability greater than 1/2, that's 2^{-64} per round for the
whole state, so 8 rounds would have no output delta with
probability 2^{-512}, so 30 rounds would be easily enough for
non-special input deltas. How likely is the most likely output
delta per chunk per input delta?

**August 9 2008** Since the 16-bit chunks are actually two copies
of an 8-bit permutation, with 3 bits possibly flipped, it's the
behavior of those 256 deltas for those 2^{3} 8-bit
permutations that matters. Everything else is combinations of those.
Only zero-deltas are deterministic; next come 24 .75 probability
deltas, 16 176/256 deltas, 8 144/256, and 32 128/256. The strongest
input delta you can get to chain 3 rounds looks about 1/3 probability
per round. The average output delta per 8-bit permutation has
probability about 36/256, so the average output delta after one round
has probability about 2^{-362}.

Perhaps I should check the configuration to guarantee 64 rounds
min, which would require me to verify no 2^{-8} per round
delta chains, which might be feasible to verify. That'd be kind of
sad, making it bigger and slower if it didn't really need it. I
had spacing.c search today for a way to guarantee 16 gaps (with 4
rounds each) but didn't find one that used under 70 blocks.

**August 10 2008** Enough, time to write this up. Me being me,
one of the biggest risks is I'll be disqualified for not following the
submission directons properly. I can't follow directions. So I need
to submit it by Aug 31 so they'll tell me if I didn't follow the
directions right.
nisthash13.c: I changed it to B^B^B^f(B)
instead of B^f(B)^f(B)^f(B). That way, if you had random access to
the key and a 4K sliding window of the data, you only need 256 bytes
or so on chip at a time, and you only have to compute f(B) once
per block.

**August 12 2008** nisthash14.c.
I initialize the accumulators with the key, and at the end XOR them
with the length (the length goes into the a[i][i%LEN] for the ith
accumulator), and I omit combining any fake leading and trailing
blocks. This avoids all combines and f(B) calculations when finishing
up the hash. It allows streaming inputs where the length is not known
until the end of the input is seen. The offsets for the length
prevent a length extension attack where you use len+1, key-1. Hashing
a null string takes 18400 cycles, which is much slower than encrypting
the null string. Hashing a 1024 byte buffer takes 25400 cycles. I
also made the final hash the whole 1024-bit state, I can't prove XORing
the high and low halves is necessary.

**August 13 2008** Rather than XORing a bunch of blocks together
and applying a round to one of the uses of each block and XORing that
to the state every three rounds, how about do the XOR thing, then go
two rounds, then XOR just one block to the state, then do another
round? That way I'd do 3 rounds of work per block rather than 4
rounds per block. Great! I'm searching for a block schedule that
passes that. Also, my billion-byte timings are actually
2^{14} timings of hashing 2^{16} bits, not a single
hash of 2^{30} bits. Gotta fix that.

**August 14 2008** nisthash15.c
I changed the block scheduling to 0(+1 rounds), 26, 28, 38. That means
3 rounds of work per block instead of 4 rounds of work per block,
bringing the time down to about 5.8 cycles/byte. I changed the use of
the key and length so that the first block is the key and the last block
is the key XORed with the length. This makes hashing an empty string
even slower: 120 rounds instead of the previous 93.

**August 17 2008** nisthash16.c
Bah, more bugs in block scheduling. I replaced
spacing.c with a version with the known
bugs fixed. Now it keeps 47 128-byte accumulators around. I split the
hash into init(), add(), final(), so now I can do a true 1-billion byte
hash of a single 64K buffer 2^{14} times. 5.1 cycles/byte,
23661 cycles to hash an empty string. A little slower for unaligned
inputs. The internal state is 6672 bytes.

**August 19 2008** nisthash17.c
I revised the key and length scheduling. Now the key is variable length,
but a multiple of 16 bytes. The key is hashed, then the data is hashed
(zero-padded to a byte boundary), then the key is hashed again, then
2 bytes 16*keylen+2*padding+1 is added (both lengths measured in bits),
then the final block is zero padded and hashed.
reference.c,
reference.h also implement it using
the NIST APIs. They're slower, because it don't use SSE and didn't
unroll loops to precompute constants, and slower yet on 32-bit platforms
where the compiler is interpreting 8-byte ints with 4-byte operations.
NIST required the length to be given in bits. Also
optimized.c,
optimized.h which implement the NIST APIs and
are the same speed as nisthash17.c. They all compute the same hash.
5.3 cycles/byte, 23077 cycles to hash the empty string.

**August 22 2008** The submission is ready.
The hash will be named Maraca. I'll mail it off tomorrow morning.
Here's the
algorithm specification.

**October 5 2008** NIST responded with a bunch of deficiencies.
Most of them are checkbox sorts of things, but some are tricky. The
worst is insufficient explanation of constants (ie you can't
deterministically derive the constants), which I can't fix. It's a
big space. My exploration of it depended on several months of
tweaking programs, exactly when blackouts stopped search programs,
bugs in helper programs, and so forth. Another serious deficiency was
lack of cryptanalysis. I knew about that one. Again, it's a big
space, I can't prove much of anything about what can't be done. But I
can do a trial cryptanalysis, so I'm doing that.

In the process I noticed my shift constants don't obey the rules I said they did. I said they were i+8*j, where all i in 0..7 and j in 0..7 are covered twice. That wasn't true, and there were two duplicate shift amounts (and a shift by zero) roughly because of that. So I'm replacing the shift amounts, which changes the hash function. I'll have to redo the tests and submit the new function.

**October 21 2008**
One of the deficiencies cited was that I hadn't addressed smaller
message digests directly. Looking at that, I realized that since the
last use of the last block isn't mixed at all, a delta in only the
high bits won't affect a truncated result. Reanalyzing the pattern
with the last use of the last block omitted, it guarantees only 24
permutations. This can be hit by deltas in blocks 2,3,7. More, every
block use in the last 30 permutations is suspect of not fully
affecting the result. I'll fix that by doing 30 permutations of
strengthening before reporting the result. It's tempting to XOR the
key at the end, too, so those 30 permutations cannot be rolled back.
But that would require analyzing the key differently than any other
data, so I didn't do it. Other ciphers appear to have some final
strengthening as well, so this hack is in good company.

I've found replacement constants for map[] and shift[]. They obey the rules and they give better measures than the old constants.

**October 22 2008** I'm resubmitting it: nist.

This is how to use a permutation to build a hash function.
The formula requires **p**
permutations of work per data block. The formula
guarantees all deltas must pass through at least **d**
permutations.
The formula requires **b** data blocks to be kept
in memory at once, and requires each block to be used **u** times.
In all cases, the +1 for everything but the
first use guarantees that a delta involving q blocks will require
at least q permutations, so the guarantee is really max(d,q).
Replacing each permutation with n permutations multiplies the guarantee by n.
There may be some other guarantee that would be more useful than these,
but by doing some overkill with these any other useful
guarantee could be met as well. Or similar searches could come up with
patterns to satisfy other guarantees directly.
What guarantee is useful depends on the use and the permutation. Most
of these I searched up to 10 out of 40 blocks, and up to 7 out of 80
blocks. The formulas may fail beyond that.

d | p | u | b | formula | date |
---|---|---|---|---|---|

1 | 1 | 1 | 1 | i | Sept 4 2008 |

1 | 2 | 2 | 1 | 2i, 2i+1 | Sept 3 2008 |

3 | 2 | 2 | 2 | 2i, 2(i+1)+1 | Sept 3 2008 |

4 | 2 | 2 | 3 | 2i, 2(i+2)+1 | Sept 3 2008 |

5 | 2 | 2 | 7 | 2i, 2(i+6-4(i%2))+1 | Sept 3 2008 |

6 | 2 | 2 | 12 | 2i, 2(i+11-4(i%2))+1 | Sept 3 2008 |

6 | 2 | 3 | 6 | 2i, 2(i+4-4(i%2))+1, 2(i+6)+1 | Sept 3 2008 |

7 | 2 | 2 | 20 | 2i, 2(i+19-4(i%4))+1 | Sept 3 2008 |

7 | 2 | 3 | 8 | 2i, 2(i+5-4(i%2))+1, 2(i+7)+1 | Sept 3 2008 |

8 | 2 | 2 | 27 | 2i, 2(i+26-4(i%4))+1 | Sept 3 2008 |

8 | 2 | 3 | 9 | 2i, 2(i+6-4(i%2))+1, 2(i+8)+1 | Sept 3 2008 |

9 | 2 | 3 | 10 | 2i, 2(i+5-4(i%2))+1, 2(i+9)+1 | Sept 3 2008 |

10 | 2 | 3 | 11 | 2i, 2(i+6-4(i%2))+1, 2(i+10)+1 | Sept 3 2008 |

11 | 2 | 3 | 20 | 2i, 2(i+2+4(i%4))+1, 2(i+19)+1 | Sept 3 2008 |

12 | 2 | 3 | 21 | 2i, 2(i+4+4(i%4))+1, 2(i+20)+1 | Sept 3 2008 |

13 | 2 | 3 | 28 | 2i, 2(i+14-4(i%4))+1, 2(i+27-4(i%2))+1 | Sept 3 2008 |

14 | 2 | 3 | 33 | 2i, 2(i+14-4(i%4))+1, 2(i+32-4((i+2)%4))+1 | Sept 3 2008 |

15 | 2 | 4 | 21 | 2i, 2(i+3)+1, 2(i+13-8(i%2))+1, 2(i+20)+1 | Sept 6 2008 |

15 | 2 | 4 | 26 | 2i, 2(i+22-6(i%4))+1, 2(i+23-6((i+2)%4))+1, 2(i+25)+1 | Aug 19 2008 |

16 | 2 | 3 | 40 | 2i, 2(i+14-4(i%4))+1, 2(i+39-8((i+1)%2))+1 | Sept 6 2008 |

16 | 2 | 4 | 24 | 2i, 2(i+14-8(i%2))+1, 2(i+4)+1, 2(i+23)+1 | Sept 6 2008 |

17 | 2 | 4 | 29 | 2i, 2(i+14-4(3i%4))+1, 2(i+22-8(i%2))+1, 2(i+28)+1 | Sept 6 2008 |

18 | 2 | 4 | 32 | 2i, 2(i+10-8(i%2))+1, 2(i+27-4(3i%4))+1, 2(i+31)+1 | Sept 6 2008 |

19 | 2 | 4 | 34 | 2i, 2(i+11-8(i%2))+1, 2(i+27-4(i%4))+1, 2(i+33)+1 | Sept 6 2008 |

20 | 2 | 4 | 37 | 2i, 2(i+17-4(3i%4))+1, 2(i+32-4(i%4))+1, 2(i+36)+1 | Sept 11 2008 |

20 | 2 | 4 | 39 | 2i, 2(i+21-6(3i%4))+1, 2(i+34-4((i+1)%4))+1, 2(i+38)+1 | Sept 11 2008 |

21 | 2 | 4 | 42 | 2i, 2(i+11-8(i%2))+1, 2(i+35-4(3i%4))+1, 2(i+41)+1 | Sept 11 2008 |

21 | 2 | 4 | 43 | 2i, 2(i+19-12(i%2))+1, 2(i+38-4((3i+1)%4))+1, 2(i+42)+1 | Sept 11 2008 |

22 | 2 | 4 | 46 | 2i, 2(i+11-8(i%2))+1, 2(i+29-4(3i%4))+1, 2(i+45)+1 | Sept 11 2008 |

23 | 2 | 4 | 48 | 2i, 2(i+13-8(i%2))+1, 2(i+39-4(i%4))+1, 2(i+47)+1 | Sept 11 2008 |

23 | 2 | 4 | 49 | 2i, 2(i+19-12(i%2))+1, 2(i+45-4((3i+2)%4))+1, 2(i+48)+1 | Sept 11 2008 |

24 | 2 | 4 | 53 | 2i, 2(i+19-12(i%2))+1, 2(i+45-4((i+1)%4))+1, 2(i+52)+1 | Sept 11 2008 |

24 | 2 | 4 | 53 | 2i, 2(i+16-4(3i%4))+1, 2(i+41-8(i%2))+1, 2(i+52)+1 | Sept 11 2008 |

25 | 2 | 4 | 61 | 2i, 2(i+22-6(i%4))+1, 2(i+48-8(i%2))+1, 2(i+61)+1 | Sept 20 2008 |

26 | 2 | 4 | 62 | 2i, 2(i+22-6(i%4))+1, 2(i+49-8(i%2))+1, 2(i+62)+1 | Sept 20 2008 |

27 | 2 | 4 | 63 | 2i, 2(i+20-16(i%2))+1, 2(i+51-6(3i%4))+1, 2(i+63)+1 | Sept 20 2008 |

30 | 3 | 4 | 47 | 3i, 3(i+21-6(i%4))+1, 3(i+41-6((i+2)%4))+1, 3(i+46)+1 | Aug 19 2008 |

ISAAC, a cryptographic pseudorandom
number generator

A small-state pseudorandom number
generator for simulations

A noncryptographic software hash
function for table lookup

Specialized hashes of a 32-bit
integer

brute.c and gather.c, tools for breaking up to 5-bit RC4
starting in the middle of the stream

a mostly-useless tool that uses
Gaussian Elimination to locate characteristics of block ciphers

Table of Contents