tobast
tobast

Reputation: 101

Find a distance measure of graphical similarity of two strings

I had no luck at finding any package like that, optimally in Python. Is there some library allowing one to graphically compare two strings?

It would, for instance, be helpful to fight against spam, when one uses я instead of R, or worse, things like Α (capital alpha, 0x0391) instead of A, to obfuscate their strings.

The interface to such a package could be something like

distance("Foo", "Bar")  # large distance
distance("Αяe", "Are")  # small distance

Thanks!

Upvotes: 5

Views: 890

Answers (2)

Graipher
Graipher

Reputation: 7186

With the information @Richard supplied in his answer, I came up with this short Python 3 script that implements UTS#39:

"""Implement the simple algorithm laid out in UTS#39, paragraph 4
"""

import csv
import re
import unicodedata

comment_pattern = re.compile(r'\s*#.*$')


def skip_comments(lines):
    """
    A filter which skip/strip the comments and yield the
    rest of the lines

    :param lines: any object which we can iterate through such as a file
        object, list, tuple, or generator
    """

    for line in lines:
        line = comment_pattern.sub('', line).strip()
        if line:
            yield line


def normalize(s):
    return unicodedata.normalize("NFD", s)


def to_unicode(code_point):
    return chr(int("0x" + code_point.lower(), 16))


def read_table(file_name):
    d = {}
    with open(file_name) as f:
        reader = csv.reader(skip_comments(f), delimiter=";")
        for row in reader:
            source = to_unicode(row[0])
            prototypes = map(to_unicode, row[1].strip().split())
            d[source] = ''.join(prototypes)
    return d
TABLE = read_table("confusables.txt")


def skeleton(s):
    s = normalize(s)
    s = ''.join(TABLE.get(c, c) for c in s)
    return normalize(s)


def confusable(s1, s2):
    return skeleton(s1) == skeleton(s2)


if __name__ == "__main__":
    for strings in [("Foo", "Bar"), ("Αяe", "Are"), ("j", "j")]:
        print(*strings)
        print("Equal:", strings[0] == strings[1])
        print("Confusable:", confusable(*strings), "\n")

It assumes that the file confusables.txt is in the directory the script is being run from. In addition, I had to delete the first byte of that file, because it was some weird, not-printable, symbol.

It only follows the simple algorithm laid out at the beginning of paragraph 4, not the more complicated cases of whole- and mixed-script confusables laid out in 4.1 and 4.2. That is left as an exercise to the reader.

Note that "я" and "R" are not considered confusable by the unicode group, so this will return False for those two strings.

Upvotes: 1

Richard
Richard

Reputation: 61389

I'm not aware of a package that does this. However, you may be able to use tools like the homoglyph attack generator, the Unicode Consortium's confusables, references from wikipedia's page on the IDN homograph attack, or other such resources to build your own library of look-alikes and build a score based on that.

EDIT: It looks as though the Unicode folks have compiled a great, big database of characters that looks alike. It's available here. If I were you, I'd build a script to read this into a Python dictionary and then parse your string for matches. An excerpt is:

FF4A ;  006A ;  MA  # ( j → j ) FULLWIDTH LATIN SMALL LETTER J → LATIN SMALL LETTER J # →ϳ→
2149 ;  006A ;  MA  # ( ⅉ → j ) DOUBLE-STRUCK ITALIC SMALL J → LATIN SMALL LETTER J # 
1D423 ; 006A ;  MA  # ( 𝐣 → j ) MATHEMATICAL BOLD SMALL J → LATIN SMALL LETTER J  # 
1D457 ; 006A ;  MA  # ( 𝑗 → j ) MATHEMATICAL ITALIC SMALL J → LATIN SMALL LETTER J  # 

Upvotes: 5

Related Questions