Reputation:
I have a very big file with millions of paths to various executables on windows systems. A simple example would be the following:
or
As a human, I can recognize that these strings are similar and match them fairly easily with some regex in code. My issue however is to find these patterns in the first place as there are far too many of those, completely unknown to me and are changing frequently.
My goal is to write a python script that finds these similar strings with a degree of certainty and groups them for me.
Which methods, libraries, keywords etc. should I look into to solve this problem?
Upvotes: 2
Views: 101
Reputation: 26169
One fairly straight-forward and simple way would be to simply check for "how much" a pair of strings differ. Like so:
import difflib
from collections import defaultdict
grouping_requirement = 0.75 # (0;1), the closer to 1, the stronger the equality needs to be to be grouped
s = r'''C:\windows\ccmcache\1d\Deploy-Application.exe
C:\WINDOWS\ccmcache\7\Deploy-Application.exe
C:\windows\ccmcache\2o\Deploy-Application.exe
C:\WINDOWS\ccmcache\6\Deploy-Application.exe
C:\WINDOWS\ccmcache\15\Deploy-Application.exe
C:\WINDOWS\ccmcache\m\Deploy-Application.exe
C:\WINDOWS\ccmcache\1g\Deploy-Application.exe
C:\windows\ccmcache\2r\Deploy-Application.exe
C:\windows\ccmcache\1l\Deploy-Application.exe
C:\windows\ccmcache\2s\Deploy-Application.exe
C:\Users\user23452345\temp\test\1\Another1-Application.exe
C:\Users\user1324asdf\temp\Another-Applicatiooon.exe
C:\Users\user23452---5\temp\lili\Another-Application.exe
C:\Users\user23hkjhf_5\temp\An0ther-Application.exe'''
groups = defaultdict(list)
def match_ratio(s1,s2):
return difflib.SequenceMatcher(None,s1,s2).ratio()
for line in set(s.splitlines()):
for group in groups:
if match_ratio(group, line) > grouping_requirement:
groups[group].append(line)
break
else:
groups[line].append(line)
for group in groups.values():
print(', '.join(group))
print()
The output of this little application is:
C:\WINDOWS\ccmcache\1g\Deploy-Application.exe, C:\WINDOWS\ccmcache\m\Deploy-Application.exe, C:\windows\ccmcache\1l\Deploy-Application.exe, C:\WINDOWS\ccmcache\15\Deploy-Application.exe, C:\WINDOWS\ccmcache\7\Deploy-Application.exe, C:\WINDOWS\ccmcache\6\Deploy-Application.exe, C:\windows\ccmcache\2s\Deploy-Application.exe, C:\windows\ccmcache\1d\Deploy-Application.exe, C:\windows\ccmcache\2o\Deploy-Application.exe, C:\windows\ccmcache\2r\Deploy-Application.exe
C:\Users\user23452345\temp\test\1\Another1-Application.exe, C:\Users\user23hkjhf_5\temp\An0ther-Application.exe, C:\Users\user1324asdf\temp\Another-Applicatiooon.exe, C:\Users\user23452---5\temp\lili\Another-Application.exe
As you see on the top of the code snippet, you see that there is a constant, grouping_requirement
, which I arbitrarily set to 0.75. If you reduce that value closer to 0.0
, more paths will be grouped together, if you raise that value closer to 1.0, fewer paths will be grouped. Good luck!
Upvotes: 0
Reputation: 88168
Try fuzzywuzzy, a soft string matcher. It makes a difference if you keep the strings as they are or lower case them first:
from fuzzywuzzy import fuzz
import itertools
lines = [
'C:\windows\ccmcache\1d\Deploy-Application.exe',
'C:\WINDOWS\ccmcache\m\Deploy-Application.exe',
'user5323\A-different-Application.bat',
]
for line1, line2 in itertools.combinations(lines, r=2):
case_match = fuzz.ratio(line1, line2)
insensitive_case_match = fuzz.ratio(line1.lower(), line2.lower())
print(line1[:10], '...', line1[:-10])
print(line2[:10], '...', line2[:-10])
print(case_match, insensitive_case_match)
print()
C:\windows ... C:\windows\ccmcached\Deploy-Appli
C:\WINDOWS ... C:\WINDOWS\ccmcache\m\Deploy-Appli
80 95
C:\windows ... C:\windows\ccmcached\Deploy-Appli
user5323\A ... user5323\A-different-Appli
42 45
C:\WINDOWS ... C:\WINDOWS\ccmcache\m\Deploy-Appli
user5323\A ... user5323\A-different-Appli
40 45
Upvotes: 1
Reputation: 1488
One possible way is to approach this by calculating the distance between strings. For that, you could use the textdistance lib.
Hope this helps!
Edit:
Two starting points to get more familiarized with the subject:
Upvotes: 2