Reputation: 10054
I need to be able to test a text list for file names that seem random;
e.g. aggvvcx.com or kbzaandc.exe
Is there any sensible/reasonable way to do this? The only idea I have is to check for ratio of occurrence of vowels to consonant but this doesn't seem reliable neither does using a dictionary.
The only information I have about the nature of the randomness is that it is a filename. Maybe it is possible to get a dictionary of common file names and use some kind of pattern parser to determine common file naming patterns and run this against the list after training? This would obviously be a futile approach if we're considering multiple languages, but I'm only interested in checking for English filenames.
Upvotes: 0
Views: 496
Reputation: 2399
I had to solve a closely related problem for a source code mining project and developed Nostril (for "Nonsense String Evaluator"). This Python 3 package is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
The project is on GitHub and I welcome contributions.
Upvotes: 0
Reputation: 3314
There are lots of randomness tests, so issue one for you is going to be deciding what you mean by randomness. I'm afraid that it's a non-trivial issue making that decision. But the Wikipedia page makes a good place to start.
https://en.wikipedia.org/wiki/Randomness_tests
The good news is that if you just need it to be "pretty jumbled" then there are a number of reasonable (i.e., computationally cheap and generally good-enough) approaches you can take.
Upvotes: 0
Reputation:
You could try
for longer strings gzip compression with zlib where greater compression indicates smaller randomness
frequency analysis of characters in the string compared to averages for appropriate natural languages
Google search assuming random strings are likely to have significantly fewer hits
soundex to determine if the string has at least one syllable and is therefore more likely to be pronunciable and so less likely to be random
n-grams with naive Bayesian analysis (http://theory.stanford.edu/~dfreeman/papers/namespam.pdf)
train a neural network to do it similarly as for spam filtering
a combination of all of the above for best results based on the approach of the winner of the Netflix challenge, namely that a combination of relatively mediocre tests may produce a much better one.
Upvotes: 1
Reputation: 727
One semi rough and quick heuristic check would be to sort a string by individual letters, and compare its sorted sequence against a likelihood of generating that sequence randomly for that length. i.e. for word length 2, a (sorted) string "AA" given 26 letters in alphabet, has a chance of 1/(26*26), but a (sorted) string "AB" - which is generated by "AB" and "BA" - has a chance of 2/(26*26).
P.S. from programming perspective, another way, would be to run a spell checker against it and figure out how many "mistakes" are there. Then put a threshold value against it.
Upvotes: 0
Reputation: 34145
What do you mean by random exactly? There are many ways to answer that question.
Technically, it could be "how much entropy they contain" using information theory methods.
Since you mention dictionary, you may actually mean "do they look like real words?" This can be checked for long texts using letter distribution, but will fail for short names like the ones you show. Instead you could try n-grams for characters. This is similar to letter frequency, but for 2/3-letter sequences. That means if you try bigrams, you'll find that the first word contains "gv", "vv", "vc", "cx", which are likely impossible to find in any English word.
There are also other ways to answer the question, so you'll have to figure out what does "random" mean to you in this situation exactly.
Upvotes: 1