If you just want a good generator, here's one that looks OK to me. It's public domain, so you can do whatever you want with it.
typedef unsigned long int u4; typedef struct ranctx { u4 a; u4 b; u4 c; u4 d;} ranctx; #define rot(x,k) ((x<<k)(x>>(32k))) static u4 ranval( ranctx *x ) { u4 e = x>a  rot(x>b, 27); x>a = x>b ^ rot(x>c, 17); x>b = x>c + x>d; x>c = x>d + e; x>d = e + x>a; return x>d; } static void raninit( ranctx *x, u4 seed ) { u4 i; x>a = 0xf1ea5eed; x>b = x>c = x>d = seed; for (i=0; i<20; ++i) { (void)ranval(x); } } 
Show of hands: Who's used a pseudorandom number generator? Who's written one? Who's tested one?

Pseudorandom: given a large chunk of a stream of bits, except for one missing bit, it's 50/50 whether the missing bit is a 1 or a 0. If you can make a correct guess with more than 50% probability, that's a bias. The bias is how much more than 50%. A random choice is a choice made with equal probability from all possibilities, and a pseudorandom choice is a deterministic choice that pretends to be random. All bits in a pseudorandom stream appear to be independent random choices.
A test tries to distinguish a pseudorandom stream from a truly random one. You can often use math to figure out stuff about a truly random stream. For example, if you've seen n bits, how many 0s should you have seen? About n/2.
Central Limit Theorem: If you count a bunch of things, your total will follow a normal curve, with a mean of n and a standard deviation of sqrt(n). 99.7% of the time the result will be within n + 3sqrt(n). Got that? Memorize that. Total of n, that means a random fluctuation of sqrt(n) on average, usually less than 3sqrt(n). Central limit theorem. Or Law or Large Numbers.
Noncryptographic: You're limited to reasonable approaches to testing. Things that real applications might care about. Mostly this means not exploiting the algorithm details when writing the test.
What's a pseudorandom number generator look like?
The internal state has to be mixed or postprocessed enough that all the bits in the results look independent from one another and uniformly distributed.
How's this different from a block cipher or hash function?
Pictures:
A block cipher is a mixing function that takes a key and a message, and is reversible for that message. In a block cipher, the user has full control over the input. If there are any patterns preserved, the user can tell. You've achieved avalanche if every output bit changes between 1/4 and 3/4 of the time, but that still allows many patterns. Cryptographic block ciphers have rounds that repeat a mixing function  they usually repeat it 3 to 12 times more than they need to to achieve avalanche.
There's not much difference between a block cipher and a random number generator producing its first result. Or a hash function and its last input. See, the pictures are the same. I've tended to view the initialization of random number generators as Someone Else's problem for this reason  you should use a block cipher to set up the initial internal state.
You can build a random number generator by repeatedly applying a block cipher to its own output. This is all mixing. The output of one round is the input to the next, so the user knows both the input and output. All the mixing is done in that one mix step.
You can build a random number generator by applying a block cipher to a counting sequence. This is all postprocessing. The internal state hardly gets mixed at all. You could count by 0xdeadbeef instead of by 1, that helps a little. The user never sees the internal state, so the path to do mixing (from the user's perspective) is twice as long. But the user has a zillion such paths to do correlations over.
Can a random number generator get away with less mixing than what is required for a block cipher? Yes. More on that later.
What's the cycle length of a pseudorandom number generator going to be?
If you've got an internal state of n bits, it can only have 2^{n} states, so you can produce at most 2^{n} results. It can't go longer than 2^{n} results without cycling.
What's the expected cycle length?
Suppose the mixing function is a truly random mapping. Let's contruct it at random as we use its values. Start with the initial internal state. Choose some next internal state. What's the chances that it's a state we've seen before? The chances are 1/2^{n}, 2/2^{n}, 3/2^{n}, ... so the chances of not cycling after q steps is roughly sum(i=1..q)(i)/2^{n}. I know, I have to multiply probabilities, not add, but (1p)(1q) is about (1pq) for very small p and q. So the sum after q steps is q(q1)/(2*2^{n}), so we'll get a cycle at about sqrt(2*2^{n}), that's 2^{(n+1)/2}. I've measured this in real generators, that's really how it works. A mixing step that is a random mapping gives you a cycle length of around 2^{(n+1)/2}. The graph looks like several 2^{(n+1)/2} and smaller cycles with lots of branching hairs feeding into them.
Suppose the mixing function is reversible. That means it's a bijection, that every value has exactly one previous value. The next value is guaranteed to be either new, or the first one. This is essentially pulling balls from a bin, looking for the one red one. You have to choose 1/2 the balls on average. I've tested this too, that's really how it works. So, cycle length 2^{n1}, the square of what you'd get if the mapping easn't known to be reversible. The next biggest cycle will be half of what's remaining, and so forth. The graph looks like one 1/2 cycle, one 1/4, one 1/8, and so forth, with quite likely some one or two state cycles at the low end. Cycle length 2^{n2} on average.
Reversible mappings don't guarantee a long cycle length. But they do cause the expected cycle length to be long. 2^{n2} on average, compared to 2^{n/2} for irreversible mappings, and they usually make small cycles so rare they're impossible to find.
Another advantage of reversible mappings is the states on the cycles tend to be uniformly distributed. They're reversible  they can't prefer to map everything to even numbers, because in reverse that would mean even numbers get mapped to go to two different things. They also cover half of all possible states on the same cycle, and every possible state is on some cycle. Reversible mappings can and just about always do show biases like preferring to map to even numbers.
Between the short cycle length and the tendency to produce biased results, I avoid irreversible mappings.
There are a couple fast reversible primitives:
Right here, I wanted to go into perpetual motion machines, and how the second law of thermodynamics isn't an axiom, it's a consequence of the laws of physics being reversible. You see, there's a zillion disordered states for every ordered one, so no single reversible process will map all the disordered states to ordered states, they just won't fit. That's the same reason there's no fixed sequence of moves that will solve a Rubik's cube. Then I'd say how reversible systems are modelled by ergodic Markov chains. Ergodic means irreducible and aperiodic. In ergodic Markov chains, even slight mixing is compounded exponentially. Eventually the system reaches an equilibrium where all reachable states are equally likely. And that's exactly how the reversible mixing functions in pseudorandom number generators work.
But ... I wasn't able to connect the dots. Pseudorandom number generators aren't ergodic, not even reversible ones. They're periodic, and often not all states lie on the same cycle. Still, if most of their state is hidden, the remaining state resembles an ergodic system. For that matter, it's still being debated whether physics is truly nondeterministic, or if it's also deterministic but with some hidden state.
At any rate. The point is, once you've mixed the internal state of a pseudorandom number generator, you want it to stay mixed. You want the second law of thermodynamics to hold for your pseudorandom number generator. No spontaneous unmixing. No longrange order, at least at less than the cycle length. And, fortunately, the number of functions that don't spontaneously unmix (until they cycle) outnumber ones that do spontaneously umix by a zillion to one. That's because "random" really means "out of all possible choices". And "unmixed" means "something so rare it couldn't happen by chance".
Unfortunately, the mixing functions that spontaneously unmix are kind of attractive. They're simple. Or they're linear. Or they decompose the internal state into several independent internal states. The key is, they've got nice math constraining their longrange behavior. A good hint that there's too much nice math is that you can predict the cycle length exactly without actually traversing the whole cycle.
The most common nice math is linearity. All the basic arithmetic operations supported by computers are linear. Except maybe table lookup, and even that depends on the table values. However, they're linear over different fields, so you can combine operations to produce a nonlinear whole. For example, + is linear mod 2^{32} and XOR is linear in GF(2^{32}), but the combination of + and XOR isn't linear. You combine + and XOR, and linearity goes away.
If you've got no nice math, and you have a uniform distribution, then tests that look at many values at once will pass. For example, the Gorilla test, which samples 2^{26} 26bit values and counts how many are missing, that'll pass. The spectral test, which forms points from short subsequences, and sees how close together the points come together in ndimensional space, that'll pass too. Linear complexity tests will pass. (They test that there is no LFSR that produces the whole sequence that is shorter than the sequence itself. I'll get to LFSRs in a bit.)
If you've got no nice math, and you have a uniform distribution and a long cycle, then the only tests you need to worry about are ones that repeatedly sample short subsequences.
How big should the internal state be?
No application today can consume more than about 2^{64} random numbers. There's only been about 2^{90} cycles executed by all of humanity so far. So a cycle length of 2^{64} should be good enough.
Is an expected cycle length of 2^{64} good enough? No, you might get unlucky. Let's allow a 2^{32} chance of being unlucky.
Applications nowadays are parallelized. They'll really have something producing unique seeds. These are farmed out to various machines, and each machine will use its seed to produce a stream of some length from it. Although individual streams aren't likely to cycle because the generator mix is reversible, the question of whether different streams will overlap is more like the irreversible case. Overlapping streams is just as bad as cycling. Worst case, each stream producing just 1 random number, so you need sqrt(states) streams before two streams overlap.
2^{64} streams should produce about 1 collision with an internal state of 128 bits (four 4byte values). That many streams isn't feasible today, not even if each stream produces just one number. Slightly more realistic, if each stream produces 2^{32} numbers, and there are 2^{32} streams, a 96bit state would produce about 1 collision. 2^{32}*(2^{32})^{2}=2^{96}. A 128bit state would have 2^{32} chance of producing a collision.
So, a generator that is good enough for all noncryptographic purposes (today) needs at least a 128bit internal state. Four 4byte values. Even that is cutting it close.
That's assuming that all possible settings of the internal state are allowed. If there are restrictions on the values for the internal state, the cycles will be smaller, and you'll probably lose reversibility's guarantee of uniform distribution too since not all internal states are accessible. (Disallowing all zeros is OK, that's just one state out of 2^{128}, that's negligible.)
Splitting the internal state into independent pieces can also cause shorter shorter. Cycles will be between the max and the product of the cycle lengths of the two pieces. Splitting the internal state also introduces some nice math, causing longrange correlations when the components cycle.
Many generators have an internal state consisting of n words. They have a counter that iterates over those words and reports them (maybe after some mixing), then replaces those words so they won't be repeated again. Since the values of those n words are independent, what is added to them doesn't matter much, so any single subsequence of n results will be truly uniform. If all internal states are equally likely, then all possibilities for those n results are equally likely (except for the effects of the other stuff added in).
Just as having no nice math guarantees no long range correlations, iterating over the state guarantees no correlations up to distance n. The remaining window for correlations starts at n+1 and drops off as mixing increases. For example, when I was trying to develop ISAAC which iterates through 257 values, the gap test would often show no biases for gaps up to length 257, then a spike at 258, then slowly decreasing biases after that. Another example, for the smallstate generators with 4 values that I'm developing now, tests of five consecutive values are the most sensitive. Any test on subsequences of n or less will pass.
This grace period means the generator really can take n+1 results to achieve sufficient mixing. Adding words to the internal state allows weaker mixing as well as longer cycles.
Small states, on the other hand, allow quick initialization. They can be implemented on embedded processors with very small memory. They can often be held entirely in machine registers. They've got their place in the world.
Reducing the state size (by using smaller words, or fewer words) is a good way to test a pseudorandom number generator. It decreases the cycle length exponentially. It increases the percentage of states that are edge cases. It makes most biases detectable in much less time. However, for biases due to insufficient mixing, reducing the word size actually increases perbit mixing.
Suppose you put things into a bucket as you find them, and you know beforehand about how many things you expect to find. Then this holds on average:
(actualexpect)^^2/expect=1Because
actual = expect + sqrt(expect) (by the Central Limit Theorem) (actualexpect) = +sqrt(expect) (actualexpect)^^2 = expect expect/expect = 1
Suppose every item seen has to go somewhere. If there are m buckets, and an item isn't placed in the first m1 buckets, it has to go in the mth. Then there's only m1 degrees of freedom.
sum(over all buckets)((expectactual)^^2/expect) = m1, not m.The central limit theorem, again, says that 99.7% of the time,
sum(over all buckets)((expectactual)^^2/expect) = (m1) + 3sqrt(m1)It's usually reported normalized, (result(m1))/sqrt(m1), so the normalized result is between 3 and 3, 99.7% of the time.
I left out lots of caveats, like a normal distribution isn't a good approximation for expected means less than 5, and the standard deviation isn't quite sqrt(m). And what if the samples aren't independent. Blah blah blah. So sometimes the normalized numbers go beyond 3 a little.
Suppose you've got a bias of sqrt(expect).
(expectactual)^^2/expect = (expect(expect+sqrt(expect)))^^2/expect = (sqrt(expect))^^2/expect = expect/expect = 1That's likely to get lost in the noise. Ah, but take twice the sample size:
(2expect2actual)^^2/2expect = (2expect2(expect+sqrt(expect)))^^2/2expect = (2sqrt(expect))^^2/2expect = 4/2 = 2And double the sample size again:
= 4
Here's the gold standard. If you get a suspicious normalized chisquare measure (like 7), double the length of the sequence being tested. If the normalized chisquare measure doubles, you've found bias.
In summary, you do a bunch of math, then if you get a normalized chisquare measure between 3 and 3 then a stream looks random. If you get a number bigger than 3, run the test on a stream twice as long and see if the result doubles. If it does then things aren't random.
Just about all tests for pseudorandom numbers are chisquare tests. The most common ones are
Consider the run test.
And I run it. And EVERY generator fails after a billion 8bit values. Hum. Scale it down. 2bit values. Those fail just about immediately.
The formula says 1/144 of all runs will have exactly five strictly increasing values. What's the chances of a strictly increasing sequence of length 5, given that there's only four possible values? (Pause, maybe repeat the question, so audience answers, or tries to think of the answer.)
Zeeeero. Oh. Ah. Published probabilities are wrong. Back to the drawing board. Moral: kick the tires. Until the test has passed for some generator, the test itself is most likely bad.
How long do you have to run tests? Central Limit Theorem again. If you run a test that counts how many times some pattern occurs, a count of n will usually have a random fluctuation of sqrt(n). A bias less than the random fluctuations will be lost in the noise. If you can guess p of the bits with probability 1/2 + 1/bias, you need a sequence with p*(bias^{2}) bits. For example, p=1/1000, bias=1/1000, you need a sequence with 1 billion bits. To detect a bias half that size you have to run the tests four times longer.
Let's run a test on a billion values. This one's producing 32bit values, counting how many bits are set in each value, concatenating those counts, then doing a chisquare calculation. That can detect a bias of at most sqrt(billion) = 1/30,000, probably less if not everything produces a result. Oops, it finished. 18 seconds. Monthlong tests are about my limit, that's sequences of length 2^{42}, 2^{44}, somewhere around there.
I'll fudge these a little, and not give optimal implementations, so I can show you their basic properties quickly. Every one of these generators uses reversible mixing of its internal state.
for (i=0; i<32; ++i) state = (state>>1) ^ ((state&1) ? 0xedb88320 : 0); return state; 
Its internal state is one 32bit word, "state". Its cycle length is 2^{32}  1, that is, there's one cycle that contains every possible state except for 0. It's linear in GF(2^{32}). It reports its whole internal state directly. There are very fast tests for recognizing that its output comes from an LFSR: the polynomial (0xedb88320) can be deduced from just 2 words of output, and the next word or two give you faith that it is indeed an LFSR.
state[i%55] = state[i%55] + state[(i+24)%55]; ++i; return state; 
This generator is recommended in Knuth's The Art of Computer Programming. The low bits form an LFSR with cycle length 2^{55}1. The higher bits do more complicated things. The high bits never affect the low bits, so the low bits have more bias than the high ones.
Summing 2 billion values takes 3 seconds. I used gcc O3 on my home Core 2 Duo.
It reports its whole state directly. It iterates through its state, so any tests on subsequences of length 55 or shorter will succeed. There are biases beyond that limit, the most obvious being that val[i+55] = val[i] + val[i+24] 100% of the time.
The Mersenne Twister is basically a huge LFSR. Its internal state is 623 32bit words, plus 1 bit, for a total of 19937 bits. Each step replaces one of these words with the XOR of itself and two others. Like any good LFSR, it has one cycle of length 2^{19937}1 that goes through all possible internal states except for allzeros. It also does some postprocessing per value:
y ^= (y >> 11); y ^= (y << 7) & 0x9d2c5680UL; y ^= (y << 15) & 0xefc60000UL; y ^= (y >> 18); return y; 
Summing 2 billion values takes 16 seconds. They've got a new version they say is 2x faster and better distributed, so perhaps that should be 8 seconds, but I haven't looked at the new version yet.
I wrote ISAAC, a cryptographic random number generator, back in 1995. ISAAC fills state (a 256term array of 32bit words) with the encryption of a seed, and also sets 32bit words a, b, c. Each step does:
x = state[i]; a = (a ^ shift(a)) + s[(i+128) % 256]; state[i] = y = state[(x/4) % 256] + a + b; i = (i+1) % 256; return b = state[(y/1024) % 256] + x; 
"shift(a)" alternates through a<<13, a>>6, a<<2, a>>16. The shift amounts were chosen to try to achieve avalanche in "a" quickly. ISAAC iterates through its state, so any test checking subsequences of length 256 or less will pass. "a" has achieved avalanche many times over by the time 256 results have been produced.
Although ISAAC does no postprocessing, it never reports its entire internal state. It seems secure: no way of deducing terms in state[] is known, and no biases have been found in it.
Summing 2 billion values takes 11 seconds.
RC4 initialization fills state[] (a 256term array of bytes) with a permutation of 0..255, and sets i=j=0. Then each step does
i = (i + 1) % 256; j = (j + state[i]) % 256; swap(state[i], state[j]); return state[(state[i] + state[j]) % 256]; 
RC4 produces a byte at a time instead of a 32bit word at a time. It does not report its internal state directly, in fact, it makes an OK stream cipher if it is seeded well. Like everything else, RC4 uses i to iterate through its state, report a value based on state[i], and replace state[i].
RC4 does not allow all possible internal states (the internal state has to be a permutation of 0..255), and there is some nice math definining equivalent states. It pays for this by having much shorter cycles than you'd expect for a generator with 258 bytes of internal state (but still much longer than you could ever practically need). In fact 1/65536 of all states are on 255term cycles that can't be reached due to the initialization, whose absence from the other cycles leads to biases that can be detected in subsequences of length 2 in about a billion values.
RC4 produces results faster than ISAAC, but since ISAAC produces 32bit results while RC4 produce 8bit results, ISAAC produces pseudorandom bytes 3x faster than RC4.
FLEA was originally an experimental cryptographic pseudorandom number generator of mine based on ISAAC. (FLEA: Fast Little Encryption Algorithm.) The state was 259 32bit terms: state[256] plus b,c,d. All initial states except all zeros was acceptable.
for (i=0; i<256; ++i) { e = state[d % 256]; state[d % 256] = b; b = (c<<9) + (c>>23) + d; c = d ^ state[i]; d = e + b; return c; } 
Although FLEA has fewer operations than ISAAC, it required more registers, which exceeded Intel's register quota and caused FLEA to be the same speed as ISAAC on Intel boxes. Although FLEA looks possibly secure, it is inherently less secure than ISAAC. Since it had no advantages over ISAAC, I gave up on it as a stream cipher.
However, if I reduced the array size in FLEA from 256 to 1, it still passed every test I could find for sequences up to 2^{41} values. That makes the state four 32bit terms: a,b,c,d. "a" replaces "state[]". I arbitrarily changed the shift amounts in the rotate, and changed an addition to XOR. No tests that ran in less than a month could tell me if one shift constant was better than another. Each pass of the new version did:
e = a; a = b; b = (c<<19) + (c>>13) + d; c = d ^ a; d = e + b; return c; 
It's now iterating through its four terms. It modifies one term and reports another. It never reports the entire current state directly. Still, it shouldn't be too hard to work through the formulas and deduce all the values.
Summing 2 billion values takes 4 seconds.
That's the background. Now we're caught up to the state of things last December.
I posted the 4term FLEA on sci.crypt.randomnumbers, saying that it passed every test I knew of for 2^{41} values. Geronimo Jones (aka John Blackman) (aka Geronimo Jones) (I think he said he's in Australia), OK, this guy out there, immediately replied that he had a test, NDA, that detected bias in 2^{30} values. That's a 2048x shorter sequence than my tests needed, quite an improvement. I ran his code and confirmed that.
NDA is a modified gap test. It watched 4bit chunks, and counted gaps of length up to 48 (that's 4*48=192 bits). This will detect bias in generators with an internal state less than 192 bits, eg 5 32bit terms or less. Instead of watching for every value and recording how long it was since that value last appeared, it watches for every value and records how long it has been since every value last appeared.
Since there are 16 possible 4bit values, it has to update 16 counts for every chunk, not just 1 like the gap test would. And since there are eight 4bit chunks per 32bit result, this test does a lot more work than most tests do for a given sequence. Even so it detected bias in my generator in half an hour, while my old tests ran for days without detecting anything. It has 16*16*49 buckets (48 + 1 for longer gaps), 12544 total, and he handtuned the expected standard deviations for that.
What NDA was detecting was bias between val[i] and val[i+3], shifted by 19 bits to the left. Or was that val[i] and val[i+4], shifted 13 bits to the right? They're the same thing to NDA.
Is it val[i+3] or val[i+4]? Sometimes you can tell by working through the algebra. Four consecutive values are r(b(c^a)), c, d^b, (a+(r(c)+d))^(r(c)+d). That's reporting the entropy in b, c, d, a respectively. Though that last one is of the form (a+xxx)^(xxx), which isn't a permutation of a, so maybe it could be val[i+3] correlating to val[i].
The gap test looks for repeats of the same value, and the rotate by 19 changed what the value was, so the gap test saw nothing. NDA looked at all values, so it was able to see the correlation, no matter what the old and new values were.
Now that I knew what to look for, I extracted bits 0,1,19,20 from each result and ran the run test on those 4bit values. That was able to detect bias in a sequence of 2^{24} values. But that's close to cheating for noncryptographic tests, I had to know which bits to look at.
The run test gave these chisquare measure for run lengths 1..7, with 8 being a bucket for everything else, for a sequence of length 2^{30}:
run : expect 7 get 976.5535 chi 366.4568 27.2979 104.7676 36.1316 77.2765 530.5687 101.2977 98.8098 0.4036Runs of length 5 were the worst (val[i]..val[i+4] were ascending but val[i+5] wasn't). However, even runs of length 1 (val[i+1] < val[i]) were wrong enough to cause the test to fail. Starting with some internal state, val[i]=c and val[i+1]=d^b. The only way runs of length 1 could be wrong is if the longrange set of internal states is biased. Isn't this generator reversible? Yes, and the reverse fails the run test too. Well that's bizarre. How could this happen? Every state is on some cycle. Maybe there are some special states that all live on their own cycle, and I'm seeing the absence of those states?
Combining a value with itself is the usual source of bias. For example, mod 2^{n}, i + permute(i) is never a permutation. (Proof: the sum of all i mod 2^{n} is 2^{n1}, but the sum of all i + permute(i) mod 2^{n} is 0.)
I assumed the run test was complaining because bits 0 and 19 are being combined with themselves before they were mixed well with other bits. The cure for that is to rearrange the shifts or introduce new ones, so that those bits are not combined with themselves. So I did that, found a new generator, confirmed it passed NDA and the run test for 2^{41} values, and posted that:
e = a; a = ((b<<15)  (b>>17)); b = c + ((d<<27)  (d>>5)); c = d + x>a; d = e + x>b; return c; 
Immediately after posting the new generator to sci.crypt.randomnumbers, Orz (aka Bob Blorp2) reported that he had bitcounting test detected bias in it in sequences of 2^{36} values. His test was much faster than NDA for a given sequence length.
His bitcounting test ran some good generators for awhile to guess what distributions would be, then ran the generator being tested and compared it to the estimated correct distribution. He would count the bits in a 32bit value, divide by 2 to reduce the number of cases, do the same for 5 consecutive words, and concatenate those counts. He did that for all 5term subsequences, including overlapping ones.
I looked at his code, scratched my head, and wrote my own version that calculated probabilities exactly and used a table to map bitcounts to buckets. The most sensitive table seemed to be one with three buckets per 32bit result: low ( < 15 bits set), medium ( 15 <= count <= 17 bits set), or high ( > 17 bits set). This could detect bias in the original 4term FLEA in 2^{24} values, just like the tuned run test but without any tuning, and it could detect bias in my new generator in 2^{36} values.
When I scheduled this talk, I thought this test was new. Nope. The same thing is in Marsaglia's DIEHARD, CountThe1s, except it was counting bits in bytes, for 5 consecutive bytes. Also Marsaglia tries to account for overlapping subsequence being correlated. My implementation ignores that, so mine has an acceptable range of about 5..5 instead of 3..3. Marsaglia's version could catch problems with a state of one 32bit value, but not a state with four 32bit values. I never saw Marsaglia's fail for that reason, and also because DIEHARD only runs it for a ridiculously short 2^{18} bytes.
As near as I can tell, the bitcounting test is detecting failure to achieve sufficient overall avalanche. It is noticing that bitcounts between nearby values don't change unpredictably enough. Unlike NDA or the gap or run test, it's nearly immune to what shift constants I choose. It doesn't care where the bits get set, so long as they get set somewhere.
A single subtraction can give diffs that look like 0x000fffff, which could make bitcounting happy even though very little mixing really happened. To get around this, I sometimes graycoded the result before counting its bits: x^(x<<1) instead of x. That changes 0x000ffff into 0x000c0000, just 2 bits set. Sometimes this helped detect bias earlier, sometimes it didn't.
I tried all variants of two shifts and three additions (or xors or minuses) that I could think of, and the bitcounting test shot down all of them in 2^{36} values or less.
Meanwhile, I tried testing initializers for FLEA. The initializer sets the internal state, then FLEA is called to produce some values. How many values need to be produced? That depends on how quickly FLEA achieves avalanche. The faster it achieves avalanche, the fewer iterations initialization is allowed to do.
I did this by using the initializer as a pseudorandom number generator. I'd set the seed to some constants, except one term was set to a counter. Then I'd run the initializer. Then I'd do one pass of the generator and report a single pseudorandom value. Then I'd increment the counter and do it again, a million times. It was slow, but I already had the random number tests set up to test it.
John Blackman (aka Geronimo Jones) suggest that instead of incrementing a counter, I use the previous seed, but flip a random bit in it. I never did try that.
Then Orz's bitcounting test indicated that the real problem with these generators was they weren't achieving enough avalanche between when the generator iterated over a term and when it next iterated over that term again (that's val[i] vs val[i+4] for my generator).
So I wrote a test that would measure avalanche between a result and the result 5 from now. It's really a test of the mixing function, not of the generated stream.
It looks like there isn't much of a difference between random number generators and hash functions or block ciphers, after all.
For every bit in the internal state, it starts with a random internal state. Then it makes a second internal state that is the same except it differs in just that bit. Then it produces four values for each state, val[i+1] ... val[i+4]. Then I diffed the two val[i+4], and counted how many bits were set in the diff. I did this a thousand times, tracking the average number of bits set and unset in the diff. I did a diff by XOR and by subtraction, and graycoded the subtraction diff. I did that for each bit in val[i]. And reported the min average count.
I did some other optimizations, like having a cutoff and quitting when I find the first bit that causes too little avalanche. And I can check the average for 64 pairs, then 128, and so forth, and spend the time on 16384 pairs for a very accurate average only if I already know the average is looking reasonable.
The ideal is 16 bits flipped on average (half of the 32 total bits). Due to my only taking so many samples, and there being lots of bits to check, random fluctuations will make the expected number a little smaller for a truly random mapping, somewhere around 15.5.
None of the generators I was testing scored better than 4 bits.
This is really fast: I can measure avalanche for a single generator with this in about a millisecond. Compare that to weeks or months for chisquare tests. I think the avalanche test and the chisquare tests are testing the same thing. The difference is that the avalanche test gets to specify what the internal state looks like, while the chisquare tests don't. The chisquare tests have to wait for fortuitous internal states to come along.
How could an avalanche of 4 bits be good enough? Recall that chisquare could detect an event with probability p that caused a given bias in a stream of p*bias*bias results? The probability p that internal states are very similar is very low. You'll never find internal states that are the same except in one term. In fact, for any internal state, only 1/103623 of all other internal states have a value of term "c" that differs from the original state's "c" by 4 bits or less. This unavoidable dissimilarity of internal states makes pseudorandom number generators rather undemanding of their mixing functions.
I rigged the avalanche test, rngav.c, to try all possible shift constants and report just the best. In a few hours I explored many times more configurations than I had in the past year. Most of the time was spent timing configurations rather than testing them.
After I found the avalanche test's recommendations, I still had to spend a month running chisquare tests on them, but the recommendations all passed. I haven't named these. They're just arbitrary configurations recommended by the avalanche test.
There was a configuration that were slighly slower than FLEA that had an avalanche of 8.8 for several shift amounts. That's better than 4, but still well less than full avalanche. Here's one of them:
typedef unsigned long int u4; typedef struct ranctx { u4 a; u4 b; u4 c; u4 d;} ranctx; #define rot(x,k) ((x<<k)(x>>(32k))) static u4 ranval( ranctx *x ) { u4 e = x>a  rot(x>b, 27); x>a = x>b ^ rot(x>c, 17); x>b = x>c + x>d; x>c = x>d + e; x>d = e + x>a; return x>d; } static void raninit( ranctx *x, u4 seed ) { u4 i; x>a = 0xf1ea5eed; x>b = x>c = x>d = seed; for (i=0; i<20; ++i) { (void)ranval(x); } } 
This passed the bitcount test for 2^{42} values. It also passed my frequency, gap, and run tests for 2^{40} values, and NDA for 2^{38} values. That's all the longer I've run any of those tests on this generator. Apparently avalanche of 8.8 is enough.
This was over twice as fast with Microsoft's Visual Studio compiler than with gcc O3. Intel allows only 2operand instructions, a=a+b, but there's a caveat that it can do 3operand addition when calculating addresses. Microsoft's compiler pretends that plain additions, like c=d+e, are doing address calculations.
Summing 2 billion values takes 6 seconds (with gcc O3).
Just in case the 8.8 bit generator isn't overkill, here's another generator that has an average avalanche of at least 13 bits. I had the avalanche test go through all 32K possible shift amounts and it liked these best, that took about a minute. This generator also passes all the chisquare tests.
u4 ranval( ranctx *x ) { u4 e = x>a  rot(x>b, 23); x>a = x>b ^ rot(x>c, 16); x>b = x>c + rot(x>d, 11); x>c = x>d + e; x>d = e + x>a; return x>d; } 
Summing 2 billion values takes 11 seconds.
Pseudorandom number generators should be reversible, nonlinear, they should have at least 128 bits of internal state, and they should mix that state quickly. If they produce bytes slower than RC4, they're too slow.
Tests of pseudorandom streams compute chisquare measures, which produce results between 3 and 3 if all is well. Usually. If all is not well, doubling the input size will double the chisquare result. Chisquare testing is incredibly slow. That's because of the central limit theorem: to find a bias of 1/n, you have to look at n^{2} values, because random fluctuations are the square root of the total. The most sensitive chisquare test I know of is to count the bits in a result, then look at the distribution of several consecutive counts. You won't find anything unless you look at subsequences bigger than the internal state.
Testing for avalanche, unlike chisquare testing, is really fast. And it seems to test the right thing. If you ever have to design a noncryptographic pseudorandom number generator, use the avalanche test to explore the space of possibilities, then use the expensive chisquare tests just as sanity checks once you've settled on a solution. Pseudorandom number generators demand much less avalanche than hash functions or block ciphers.