Dave
Dave

Reputation: 390

python regex replace unicode

In the first test string, I'm trying to replace the Unicode right arrows char in the middle of the text with a space, but it doesn't seem to be working.

In general, I'm trying to remove all single character or more unicode "non-words", but keeping words if they are a mixture of a-z0-9 and unicode or just \w

# -*- coding: utf-8 -*-
import re
str = 'hi… » Test'
str = 're of… » Pr'
str = 're of… » Pr | removepipeaswell'
print str
str = re.sub(r' [^a-z0-9]+ ', ' ', str , re.UNICODE|re.MULTILINE)
# str = re.sub(r' [^\p{Alpha}] ', ' ', str, re.UNICODE)
print str
're of… Pr removepipeaswell' #expected output

str_nbsp = 'afds » asf'

edit: added another test string, i dont want to remove the "of..." (unicode dots), i want to remove multiple unicode (non-word) chars only.

edit: using this works for the test case, (but not in the full html??? - it only appears to replace matches to the first half to the string, then ignores the rest.)

str = re.sub(r' [^a-z0-9]+ ', ' ', str , re.UNICODE|re.MULTILINE)

edit: fml, it had to be something stupid like not reading the argument list properly: http://bytes.com/topic/python/answers/689341-sub-does-not-replace-all-occurences

[whoever just deleted their response - thank you for your help.]

str = re.sub(r' [^a-z0-9]+ ', ' ', str)

The final test string "str_nbsp" did not match the regex above. One of the space characters is actually a non breaking space character. I used www.regexr.com and hovered over each character to figure this out.

Upvotes: 2

Views: 3730

Answers (1)

Dave
Dave

Reputation: 390

str = re.sub(r' [^a-z0-9]+ ', ' ', str)

Upvotes: 3

Related Questions