Reputation: 642
Here randomness mean pseudo randomness like the random number generator of Linux. For example, I have 100-1000 arrays that each contain 10000 random integers generated by Linux pseudo random number generator. And now given an new integers sequence, if any machine learning algorithm like classification or clustering can detect if this integers sequence is a pseudo random number sequence like the previous training data?
For some reason, I don't care the real randomness of given sequence actually, I just want to know if this given sequence is generated by some sepcial linux pseudo random integer generator. Suppose Linux RNG does have one induction function to generate a pseudo random integers sequence, can we predict if an existing integer sequence is generated by this RNG based on some existing random sequences generated by it?
Upvotes: 2
Views: 1323
Reputation: 17945
Let us clarify what is meant by "linux PRNG", by assuming you actually mean "/dev/urandom", as recommended in this answer.
Now, the algorithm behind it is well known - let us read about it in the source comments:
When random bytes are desired, they are obtained by taking the SHA hash of the contents of the "entropy pool". The SHA hash avoids exposing the internal state of the entropy pool. It is believed to be computationally infeasible to derive any useful information about the input of SHA from its output. Even if it is possible to analyze SHA in some clever way, as long as the amount of data returned from the generator is less than the inherent entropy in the pool, the output data is totally unpredictable. For this reason, the routine decreases its internal estimate of how many bits of "true randomness" are contained in the entropy pool as it outputs random numbers.
If this estimate goes to zero, the routine can still generate random numbers; however, an attacker may (at least in theory) be able to infer the future output of the generator from prior outputs. This requires successful cryptanalysis of SHA, which is not believed to be feasible, but there is a remote possibility. Nonetheless, these numbers should be useful for the vast majority of purposes.
So there is a pool of randomness obtained from different sources, and periodically renewed; a typical size seems to be 4096 bits. Replace "attacker" with "machine learning", and you have your answer:
if the pool is empty (that is, a very long sequence of random bytes was requested without the pool being replenished), then the problem is equivalent to identifying the output of SHA-1.
if the pool is being replenished between draws of numbers, then the problem is even harder, because you will have shorter runs on which to test any hypothesis on how numbers were being generated.
There is no known way to reverse SHA-1, or to distinguish SHA-1 outputs from non-SHA-1 outputs. Standard classification and clustering will certainly fail at this task, and it would be very surprising to find an algorithm that succeeded, because it would present a workable attack against SHA-1 itself. Not that SHA-1 is perfect (it is already being deprecated, because collision attacks have already been described, and they will only get better) -- but it has withstood time pretty well.
Upvotes: 2
Reputation: 625
This is a traditional question people have asked. For a classic analysis, please consult The Art of Computer Programming by Donald Knuth, Volume 2 -- "Seminumerical Algorithms".
If you require more state-of-the-art test, there is a software suite available at http://csrc.nist.gov/groups/ST/toolkit/rng/index.html, which also contains ample documentation.
You may also consider reading the Wikipedia page: http://en.wikipedia.org/wiki/Statistical_randomness to get an introductory idea of what the field is about.
Upvotes: 1