Reputation: 2399
A Python application I'm writing needs to extract identifiers and text strings out of source code. A small percentage of what it finds are (seemingly) random strings. I'd like to filter them out, but so far have not been able to create a regexp to do it. It is not possible to filter by length only, because there are some very long identifiers that are valid. Here is an example taken at random, compared to a valid identifier of the same length:
UGxhemEgZGUgaWZXNaWdhZGyOiBDSUWRVNUQVYtSVBOIFVuaWQ
NSApplicationDidChangeScreenParametersNotification
Is there a way to write a regexp or other detection system that would detect junk sequences like this? I am beginning to suspect it can't be done without testing strings against a large dictionary of words, which I believe would be prone to errors as well as be compute-intensive. However, maybe someone more clever knows of an approach to detecting or matching random sequences like this?
An ideal solution to this problem would be a function that can take a string as input, and report if it is "probably" random. It could produce false negatives (misreport some random strings as not random), preferably with low probability, but it must not report false positives (something is random when it is not). In case it matters, the strings seem to range in length from 25 to 80 characters.
EDIT #1 2017-02-08: thinking further, it occurred to me that a possible approach might be a regexp that matches a minimum number of unique characters in a row. For example, the 2nd character would have to be something different from the first, the 3rd different from the previous two, the 4th different from the previous 3, etc. A long enough version of this might catch a lot of random sequences. However, looking at different regexp operators, I don't see a version of (for lack of better words) a "negative back reference" or "match something other than the thing you just matched". If someone knows a variation on this, maybe I can make it work.
EDIT #1 2017-02-10: I'm worried that the way I wrote my two example strings above may be misinterpreted as a single string. The examples above are two separate strings of the same length – my sincere apologies if that was unclear. Here are some more examples; each line is a separate identifier. This also shows different lengths on purpose.
shouldBeAbleToCountLiveNeighboursOfACellOnDiagonalsAndStraightLines
eXNZWzIGbHRpbWVkaWEgYWkIGFuaWhdGlvbiBkaXNcmlidXRlZCNCpUgRGlzdHJpYnV
dWxLXRvbGVyYWIHJlYWwtdGltZSBzeXNZWzLgKlSBEaXNcmlidXRlZCBBcmNoaXRlYR
dGhIExvIHNYmltbMgYSBsYSBwWdpbmEgeSBsbyBhbnVuYlhbWzIGVuIGVsIHByhpbWg
aGUgYuZmVyZWjZSBwcmjZWVkaWncygDQoNClNYmpcNpbNCkluIGyZGVyIHRvIHN
YQKUGFyYTogZXNYFyQGluYWlcCteAKQMIExaXMgQSgUGluZWRhDQpDQzogQuYVw
thehasSizeMatcherShouldMatchACollectionWithExpectedSize
QycmVvIGRlIERpcVtaWhYnDsgZGUgYWNaXZpZGFkZXMgZGUgbGEg
NSAppleEventManagerWillProcessFirstEventNotification
SNMTransformGizmoRotationControllerPerformTransform
RndkOiBEaWZcnDsgZGUgYudmjYXRvcmlhIFNVTUJVCBlbiBSRUJ
For whatever it's worth, I put on pastebin a list of the 1000 longest identifiers pulled by my application from a semi-random selection of about 900 GitHub repositories. It contains both real identifiers and random strings.
Upvotes: 2
Views: 1487
Reputation: 688
First of all, thank you, your question interested me plus I was searching for some fun training. Below I have implemented my idea mentioned in comments to your post, as well as idea of @swbandit. Also it is possible to add any other tactic in the code by modifying is_rnd
function.
I have generated the conscious strings from the short dictionary found here (https://gist.github.com/jbergantine/2390284) (of course this dictionary is small and may be not representative but I used it for testing purposes). These strings are reffered as strok
in the code. After that, the random string (strrnd
) of the same length was generated. I used only lower case characters and assume that there are no spaces in the strings.
Functions is_rnd1
and is_rnd2
return True
if the string is random. Function is_rnd1
checks the frequency of the most frequent english character 'e' (12.7%) and the most rare 'z' (0.074%). However, in the function the boundaries of the frequencies are significantly expanded. Function is_rnd2
just check the apperance of four consecutive consonants as was suggested by @swbandit.
In the test part of the code the functions described above are tested for different lengths of the strings which is measured in terms of number of words constituting strok
. Function is_rnd
is called twice. First with strok
and second with random string. The errors in defining a string random or not are summed.
So this is the code:
nouns = ['here is a list from gist.github.com/jbergantine/2390284']
allch = "abcdefghijklmnopqrstuvwxyz"
import numpy as np
import matplotlib.pyplot as plt
import random, string
import collections
import re
alb = 'etaoinshrdlcumwfgypbvkjxqz'
def frqlist(s):
dic = collections.Counter(s)
arr = np.zeros(len(alb))
for key in dic:
idx = alb.index(key)
arr[idx] = float(dic[key])/float(len(s))
return arr
def generate_strs(nw=1):
strok = ''.join([nouns[random.randrange(0, len(nouns))]
for i in range(nw)])
rand_str = lambda n: ''.join([random.choice(string.lowercase)
for i in xrange(n)])
strrnd = rand_str(len(strok))
return strok, strrnd
def is_rnd1(s):
fq = frqlist(s)
return not (fq[0] > 0.07 and fq[-1] < 0.01)
def is_rnd2(s):
return re.search(r'[^aeiou]{4}', s)
maxwords = 12
nprobe = 1000
is_rnd = is_rnd1
nwa = []
err = []
print "Words\t% of errors"
for nw in range(1, maxwords):
errok = 0
errrnd = 0
for i in range(0, nprobe):
strok, strrnd = generate_strs(nw)
if is_rnd(strok):
errok += 1./nprobe
if not is_rnd(strrnd):
errrnd += 1./nprobe
print nw, "\t", (errok*100. + errrnd*100.)/2.
nwa.append(nw)
err.append((errok*100. + errrnd*100.)/2.)
plt.plot(nwa, err)
plt.show()
Below are some results
For function is_rnd1
Words % of errors
1 28.2
2 20.45
3 17.15
4 13.15
5 13.7
6 10.65
7 9.25
8 7.35
9 6.5
For function is_rnd2 (4 consecutive consonants)
Words % of errors
1 23.15
2 13.0
3 13.55
4 17.75
5 22.2
6 24.35
7 27.7
8 30.6
9 33.25
For function is_rnd2 (6 consecutive consonants)
Words % of errors
1 39.45
2 20.8
3 11.9
4 6.5
5 4.05
6 3.05
7 2.5
8 1.6
9 2.0
For me the results were interesting.
UPDATE:
I have tried machine learning. I used one neuron which has 26 entrances and one out. On the entrances the frequencies of characters in the string are supplied. Output is 1 if the string is random and 0 otherwise. Neuron is described by following class:
class Neuron(object):
def __init__(self, nin, wt=0.):
self.nin = nin
self.w = np.full(nin, wt, dtype=np.float32)
self.out = 0.
self.learnspd = 0.01
def result(self, ins):
self.out = np.sum(self.w * ins)
self.out = 1. if self.out > 0.1 else 0.
return self.out
def correctw(self, ins, err):
self.w = self.w + err*self.learnspd*ins
After defining the neuron neuron = Neuron(len(alb))
the procedure of its learning is realized:
def learning(neuron, nset, maxwords):
for i in xrange(nset):
nw = np.random.randint(1, maxwords+1)
strok, strrnd = generate_strs(nw)
fq = frqlist(strok)
neurres = neuron.result(fq)
errok = 0.0 - neurres
neuron.correctw(fq, errok)
fq = frqlist(strrnd)
neurres = neuron.result(fq)
errrnd = 1.0 - neurres
neuron.correctw(fq, errrnd)
Let's learn learning(neuron, nset, maxwords)
.
Finaly, the neuron can be used:
def is_rnd_neuron(s, neuron):
fq = frqlist(s)
return bool(neuron.result(fq))
Use the same testing procedure as described above I have obtained the following results:
nset = 100
Words % of errors
1 50.0
2 50.0
3 50.0
4 50.0
5 50.0
6 50.0
7 50.0
8 50.0
9 50.0
nset = 500
Words % of errors
1 20.4
2 13.25
3 9.5
4 5.55
5 5.95
6 3.9
7 3.35
8 2.55
9 2.4
nset = 1000
Words % of errors
1 16.95
2 9.35
3 4.55
4 2.4
5 1.7
6 0.65
7 0.4
8 0.15
9 0.1
nset = 5000
Words % of errors
1 16.6
2 7.25
3 3.8
4 1.8
5 1.1
6 0.5
7 0.1
8 0.1
9 0.05
I was really impressed how easy it can be realized and how fine results it produces.
Upvotes: 2
Reputation: 538
You could take a look at rrenaud's gibberish detector which relies on Markov Chains. You would need to alter the given code to suit your needs as you would be looking for gibberish containing non-gibberish. But it may help.
It is probably a great place to start.
But... I had some fun trying to solve this problem myself first. This may not be the best answer, and it completely relies on each word starting with a capital letter (based on your given test input). But, I enjoyed playing around with ideas to get a result that gives a good result, but would be extremely slow on much longer text (the kind you would probably be looking at).
import enchant #Spellchecker
import re
endict = enchant.Dict("en_US")
#Test input...
instr = 'UGxhemEgZGUgaWZXNaWdhZGyOiBDSUWRVNUQVYtSVBOIFVuaWQNSApplicationDidChangeScreenP arametersNotification'
outstr = ''
acceptable_short_words = ['a', 'I', 'if', 'is'] #Any one or two letter words you can think of that are OK.
#Split the words based on capitals.
words = re.findall('[A-Z][^A-Z]*', instr)
def isWord(wordin):
if(wordin in acceptable_short_words):
return True
elif(len(wordin) < 3):
return False
return endict.check(wordin)
for w in words:
if(isWord(w)):
outstr += " " + w
print(outstr.strip())
Running this script results in:
I Application Did Change Screen Parameters Notification
At first I tried searching the string for words. This had poor results as it detected words within words (for example: Notification also contains the words 'Not', 'if', 'cat', 'at' etc.) So I decided to split on the capitals, and check each element against the dictionary. This also doesn't work well, as Many of the single letter words proved to be words in English:
U - adjective|British|informal: (of language or social behaviour) characteristic of or appropriate to the upper social classes. ("U manners")
Who knew?
So finally, I decided to ignore any short word I didn't know (quite a few it turns out) and restrict it to common short words. I could go further and check word pair frequency using NLTK or a similar tool in a similar way to the gibberish detector. But it's been done so many times I'd rather not.
Upvotes: 0