Reputation: 101
I had no luck at finding any package like that, optimally in Python. Is there some library allowing one to graphically compare two strings?
It would, for instance, be helpful to fight against spam, when one uses я
instead of R
, or worse, things like Α
(capital alpha, 0x0391) instead of A
, to obfuscate their strings.
The interface to such a package could be something like
distance("Foo", "Bar") # large distance
distance("Αяe", "Are") # small distance
Thanks!
Upvotes: 5
Views: 890
Reputation: 7186
With the information @Richard supplied in his answer, I came up with this short Python 3 script that implements UTS#39:
"""Implement the simple algorithm laid out in UTS#39, paragraph 4
"""
import csv
import re
import unicodedata
comment_pattern = re.compile(r'\s*#.*$')
def skip_comments(lines):
"""
A filter which skip/strip the comments and yield the
rest of the lines
:param lines: any object which we can iterate through such as a file
object, list, tuple, or generator
"""
for line in lines:
line = comment_pattern.sub('', line).strip()
if line:
yield line
def normalize(s):
return unicodedata.normalize("NFD", s)
def to_unicode(code_point):
return chr(int("0x" + code_point.lower(), 16))
def read_table(file_name):
d = {}
with open(file_name) as f:
reader = csv.reader(skip_comments(f), delimiter=";")
for row in reader:
source = to_unicode(row[0])
prototypes = map(to_unicode, row[1].strip().split())
d[source] = ''.join(prototypes)
return d
TABLE = read_table("confusables.txt")
def skeleton(s):
s = normalize(s)
s = ''.join(TABLE.get(c, c) for c in s)
return normalize(s)
def confusable(s1, s2):
return skeleton(s1) == skeleton(s2)
if __name__ == "__main__":
for strings in [("Foo", "Bar"), ("Αяe", "Are"), ("j", "j")]:
print(*strings)
print("Equal:", strings[0] == strings[1])
print("Confusable:", confusable(*strings), "\n")
It assumes that the file confusables.txt
is in the directory the script is being run from. In addition, I had to delete the first byte of that file, because it was some weird, not-printable, symbol.
It only follows the simple algorithm laid out at the beginning of paragraph 4, not the more complicated cases of whole- and mixed-script confusables laid out in 4.1 and 4.2. That is left as an exercise to the reader.
Note that "я" and "R" are not considered confusable by the unicode group, so this will return False
for those two strings.
Upvotes: 1
Reputation: 61389
I'm not aware of a package that does this. However, you may be able to use tools like the homoglyph attack generator, the Unicode Consortium's confusables, references from wikipedia's page on the IDN homograph attack, or other such resources to build your own library of look-alikes and build a score based on that.
EDIT: It looks as though the Unicode folks have compiled a great, big database of characters that looks alike. It's available here. If I were you, I'd build a script to read this into a Python dictionary and then parse your string for matches. An excerpt is:
FF4A ; 006A ; MA # ( j → j ) FULLWIDTH LATIN SMALL LETTER J → LATIN SMALL LETTER J # →ϳ→
2149 ; 006A ; MA # ( ⅉ → j ) DOUBLE-STRUCK ITALIC SMALL J → LATIN SMALL LETTER J #
1D423 ; 006A ; MA # ( 𝐣 → j ) MATHEMATICAL BOLD SMALL J → LATIN SMALL LETTER J #
1D457 ; 006A ; MA # ( 𝑗 → j ) MATHEMATICAL ITALIC SMALL J → LATIN SMALL LETTER J #
Upvotes: 5