regex works on pythex but not python2.7, finding unicode representations by regex

Question

I am having a strange regex issue where my regex works on pythex, but not in python itself. I am using 2.7 right now. I want to remove all unicode instances like \x92, of which there are many (like 'Thomas Bradley \x93Brad\x94 Garza',:

import re, requests

def purify(string):
    strange_issue = r"""\tGfacebook.com/KilledByPolice/posts/625590984135709	])|(?:target[=]new[>])(?P[A-Z].*?)[,]"

urls = {2013: "http://www.killedbypolice.net/kbp2013.html",
        2014: "http://www.killedbypolice.net/kbp2014.html",
        2015: "http://www.killedbypolice.net/" }

names_of_the_dead = []

for url in urls.values():
    response = requests.get(url)
    content = response.content
    people_killed_by_police_that_year_alone = re.findall(name_rgx, content)
    for dead_person in people_killed_by_police_that_year_alone:
        names_of_the_dead.append(purify(dead_person))

dead_americans_as_string = ", ".join(names_of_the_dead)
print("RIP, {} since 2013:
".format(len(names_of_the_dead))) # 3085! :)
print(dead_americans_as_string)



In [95]: unicode_chars_rgx = r"[\][x]\d+"

In [96]: testcase = "Myron De\x92Shawn May"

In [97]: x = purify(testcase)

In [98]: x
Out[98]: 'Myron De\x92Shawn May'

In [103]: match = re.match(unicode_chars_rgx, testcase)

In [104]: match

How can I get these \x00 characters out? Thank you

Ignacio Vazquez-Abrams · Accepted Answer

Certainly not by trying to find things that look like "\x00".

If you want to destroy the data:

>>> re.sub('[\x7f-\xff]', '', "Myron De\x92Shawn May")
'Myron DeShawn May'

More work, but tries to preserve the text as well as possible:

>>> import unidecode
>>> unidecode.unidecode("Myron De\x92Shawn May".decode('cp1251'))
"Myron De'Shawn May"

regex works on pythex but not python2.7, finding unicode representations by regex

Answers (1)

Related Questions