Chibueze Opata
Chibueze Opata

Reputation: 10054

How to determine if a filename is random?

I need to be able to test a text list for file names that seem random;

e.g. aggvvcx.com or kbzaandc.exe

Is there any sensible/reasonable way to do this? The only idea I have is to check for ratio of occurrence of vowels to consonant but this doesn't seem reliable neither does using a dictionary.

EDIT: Definition of randomness

The only information I have about the nature of the randomness is that it is a filename. Maybe it is possible to get a dictionary of common file names and use some kind of pattern parser to determine common file naming patterns and run this against the list after training? This would obviously be a futile approach if we're considering multiple languages, but I'm only interested in checking for English filenames.

Upvotes: 0

Views: 496

Answers (5)

mhucka
mhucka

Reputation: 2399

I had to solve a closely related problem for a source code mining project and developed Nostril (for "Nonsense String Evaluator"). This Python 3 package is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.

Example: the following code,

from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
             'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
    print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

will produce the following output:

bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense

The project is on GitHub and I welcome contributions.

Upvotes: 0

Burleigh Bear
Burleigh Bear

Reputation: 3314

There are lots of randomness tests, so issue one for you is going to be deciding what you mean by randomness. I'm afraid that it's a non-trivial issue making that decision. But the Wikipedia page makes a good place to start.

https://en.wikipedia.org/wiki/Randomness_tests

The good news is that if you just need it to be "pretty jumbled" then there are a number of reasonable (i.e., computationally cheap and generally good-enough) approaches you can take.

Upvotes: 0

user4322779
user4322779

Reputation:

You could try

  1. https://github.com/rrenaud/Gibberish-Detector

  2. for longer strings gzip compression with zlib where greater compression indicates smaller randomness

  3. frequency analysis of characters in the string compared to averages for appropriate natural languages

  4. Google search assuming random strings are likely to have significantly fewer hits

  5. soundex to determine if the string has at least one syllable and is therefore more likely to be pronunciable and so less likely to be random

  6. n-grams with naive Bayesian analysis (http://theory.stanford.edu/~dfreeman/papers/namespam.pdf)

  7. train a neural network to do it similarly as for spam filtering

  8. a combination of all of the above for best results based on the approach of the winner of the Netflix challenge, namely that a combination of relatively mediocre tests may produce a much better one.

Upvotes: 1

Nicko Po
Nicko Po

Reputation: 727

One semi rough and quick heuristic check would be to sort a string by individual letters, and compare its sorted sequence against a likelihood of generating that sequence randomly for that length. i.e. for word length 2, a (sorted) string "AA" given 26 letters in alphabet, has a chance of 1/(26*26), but a (sorted) string "AB" - which is generated by "AB" and "BA" - has a chance of 2/(26*26).

P.S. from programming perspective, another way, would be to run a spell checker against it and figure out how many "mistakes" are there. Then put a threshold value against it.

Upvotes: 0

viraptor
viraptor

Reputation: 34145

What do you mean by random exactly? There are many ways to answer that question.

Technically, it could be "how much entropy they contain" using information theory methods.

Since you mention dictionary, you may actually mean "do they look like real words?" This can be checked for long texts using letter distribution, but will fail for short names like the ones you show. Instead you could try n-grams for characters. This is similar to letter frequency, but for 2/3-letter sequences. That means if you try bigrams, you'll find that the first word contains "gv", "vv", "vc", "cx", which are likely impossible to find in any English word.

There are also other ways to answer the question, so you'll have to figure out what does "random" mean to you in this situation exactly.

Upvotes: 1

Related Questions