Reputation:

Python library to automatically detect unknown similar strings

I have a very big file with millions of paths to various executables on windows systems. A simple example would be the following:

C:\windows\ccmcache\1d\Deploy-Application.exe
C:\WINDOWS\ccmcache\7\Deploy-Application.exe
C:\windows\ccmcache\2o\Deploy-Application.exe
C:\WINDOWS\ccmcache\6\Deploy-Application.exe
C:\WINDOWS\ccmcache\15\Deploy-Application.exe
C:\WINDOWS\ccmcache\m\Deploy-Application.exe
C:\WINDOWS\ccmcache\1g\Deploy-Application.exe
C:\windows\ccmcache\2r\Deploy-Application.exe
C:\windows\ccmcache\1l\Deploy-Application.exe
C:\windows\ccmcache\2s\Deploy-Application.exe

C:\Users\user23452345\temp\test\1\Another1-Application.exe
C:\Users\user1324asdf\temp\Another-Applicatiooon.exe
C:\Users\user23452---5\temp\lili\Another-Application.exe
C:\Users\user23hkjhf_5\temp\An0ther-Application.exe

As a human, I can recognize that these strings are similar and match them fairly easily with some regex in code. My issue however is to find these patterns in the first place as there are far too many of those, completely unknown to me and are changing frequently.

My goal is to write a python script that finds these similar strings with a degree of certainty and groups them for me.

Which methods, libraries, keywords etc. should I look into to solve this problem?

Upvotes: 2

Answers (3)

Jonas Byström

Reputation: 26169

One fairly straight-forward and simple way would be to simply check for "how much" a pair of strings differ. Like so:

import difflib
from collections import defaultdict


grouping_requirement = 0.75 # (0;1), the closer to 1, the stronger the equality needs to be to be grouped

s = r'''C:\windows\ccmcache\1d\Deploy-Application.exe
C:\WINDOWS\ccmcache\7\Deploy-Application.exe
C:\windows\ccmcache\2o\Deploy-Application.exe
C:\WINDOWS\ccmcache\6\Deploy-Application.exe
C:\WINDOWS\ccmcache\15\Deploy-Application.exe
C:\WINDOWS\ccmcache\m\Deploy-Application.exe
C:\WINDOWS\ccmcache\1g\Deploy-Application.exe
C:\windows\ccmcache\2r\Deploy-Application.exe
C:\windows\ccmcache\1l\Deploy-Application.exe
C:\windows\ccmcache\2s\Deploy-Application.exe
C:\Users\user23452345\temp\test\1\Another1-Application.exe
C:\Users\user1324asdf\temp\Another-Applicatiooon.exe
C:\Users\user23452---5\temp\lili\Another-Application.exe
C:\Users\user23hkjhf_5\temp\An0ther-Application.exe'''

groups = defaultdict(list)


def match_ratio(s1,s2):
    return difflib.SequenceMatcher(None,s1,s2).ratio()


for line in set(s.splitlines()):
    for group in groups:
        if match_ratio(group, line) > grouping_requirement:
            groups[group].append(line)
            break
    else:
        groups[line].append(line)

for group in groups.values():
    print(', '.join(group))
    print()

The output of this little application is:

C:\WINDOWS\ccmcache\1g\Deploy-Application.exe, C:\WINDOWS\ccmcache\m\Deploy-Application.exe, C:\windows\ccmcache\1l\Deploy-Application.exe, C:\WINDOWS\ccmcache\15\Deploy-Application.exe, C:\WINDOWS\ccmcache\7\Deploy-Application.exe, C:\WINDOWS\ccmcache\6\Deploy-Application.exe, C:\windows\ccmcache\2s\Deploy-Application.exe, C:\windows\ccmcache\1d\Deploy-Application.exe, C:\windows\ccmcache\2o\Deploy-Application.exe, C:\windows\ccmcache\2r\Deploy-Application.exe

C:\Users\user23452345\temp\test\1\Another1-Application.exe, C:\Users\user23hkjhf_5\temp\An0ther-Application.exe, C:\Users\user1324asdf\temp\Another-Applicatiooon.exe, C:\Users\user23452---5\temp\lili\Another-Application.exe

As you see on the top of the code snippet, you see that there is a constant, grouping_requirement, which I arbitrarily set to 0.75. If you reduce that value closer to 0.0, more paths will be grouped together, if you raise that value closer to 1.0, fewer paths will be grouped. Good luck!

Upvotes: 0

Hooked

Reputation: 88168

Try fuzzywuzzy, a soft string matcher. It makes a difference if you keep the strings as they are or lower case them first:

from fuzzywuzzy import fuzz
import itertools

lines = [
    'C:\windows\ccmcache\1d\Deploy-Application.exe',
    'C:\WINDOWS\ccmcache\m\Deploy-Application.exe',
    'user5323\A-different-Application.bat',
]

for line1, line2 in itertools.combinations(lines, r=2):

    case_match = fuzz.ratio(line1, line2)
    insensitive_case_match = fuzz.ratio(line1.lower(), line2.lower())

    print(line1[:10], '...', line1[:-10])
    print(line2[:10], '...', line2[:-10])
    print(case_match, insensitive_case_match)
    print()

C:\windows ... C:\windows\ccmcached\Deploy-Appli
C:\WINDOWS ... C:\WINDOWS\ccmcache\m\Deploy-Appli
80 95

C:\windows ... C:\windows\ccmcached\Deploy-Appli
user5323\A ... user5323\A-different-Appli
42 45

C:\WINDOWS ... C:\WINDOWS\ccmcache\m\Deploy-Appli
user5323\A ... user5323\A-different-Appli
40 45

Upvotes: 1

89f3a1c

Reputation: 1488

One possible way is to approach this by calculating the distance between strings. For that, you could use the textdistance lib.

Hope this helps!

Edit:

Two starting points to get more familiarized with the subject:

Upvotes: 2

Python library to automatically detect unknown similar strings

Answers (3)

Related Questions