deadlock
deadlock

Reputation: 7310

How to account for accent characters for regex in Python?

I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:

hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)

It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example: áéíóúñü¿.

If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz would be #yogenfr.

I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz

How can I go about doing this

Upvotes: 35

Views: 32017

Answers (6)

Quetzal
Quetzal

Reputation: 1

This one (copy-pasted from #21 Zanga) is not that bad, except... when you have to type À and ÿ (which are not really handy):

hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)

then the same, but with Unicode range:

hashtags = re.findall(r'#([A-Za-z0-9\u00OC-\u00FF]+)', str1)

which are really more easy to type.

There was a solution with decimal range, but I do not remember it.

Upvotes: 0

zanga
zanga

Reputation: 719

I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.

hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)

which will return ['#yogenfrüz']

Hope this'll help anyone else.

Upvotes: 23

Andj
Andj

Reputation: 1374

Building on all the other answers:

The key problem is that the re module differs in significant ways to other regular expression engines. In theory, Unicode's definition of \w metacharacter would do what the question requires, but the re module does not implement Unicode's \w metacharacter.

The easy solution is to swap the regular expression engine, using a solution that is more compatible. The easiest way is to install the regex module and use it. The code that some of the other answers have given will then work as the question needs.

import regex as re
# import unicodedata as ud
import unicodedataplus as ud
hashtags = re.findall(r'#(\w+)', ud.normalize("NFC",str1))

Or if you only what to focus on Latin script, including non-spacing marks (i.e. combining diacritics):

import regex as re
# import unicodedata as ud
import unicodedataplus as ud
hashtags = re.findall(r'#([\p{Latin}\p{Mn}]+)', ud.normalize("NFC",str1))

P.S. I have used unicodedataplus which is a drop-in replacement for unicodedata. It has additional methods, and it is kept up to date with Unicode versions. With unicodedata module to up date the Unicode version required updating Python.

Upvotes: 2

Shabbir Khan
Shabbir Khan

Reputation: 189

Here's an update to Ibrahim Najjar's original answer based on the comment Martijn Pieters made to the answer and another answer Martijn Pieters gave in https://stackoverflow.com/a/16467505/5302861:

import re
import unicodedata

s = "#ábá123"
n = unicodedata.normalize('NFC', s)

print(n)
c = ''.join(re.findall(r'#\w+', n, re.UNICODE))
print(s, len(s), c, len(c))

Upvotes: 0

Berk
Berk

Reputation: 358

You may also want to use

import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a? Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...

import unicodedata output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore') Explicit example...

myfoo = u'àà'
myfoo
u'\xe0\xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'

check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?

Upvotes: 6

Ibrahim Najjar
Ibrahim Najjar

Reputation: 19423

Try the following:

hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)

Regex101 Demo

EDIT Check the useful comment below from Martijn Pieters.

Upvotes: 36

Related Questions