Reputation: 390
In the first test string, I'm trying to replace the Unicode right arrows char in the middle of the text with a space, but it doesn't seem to be working.
In general, I'm trying to remove all single character or more unicode "non-words", but keeping words if they are a mixture of a-z0-9 and unicode or just \w
# -*- coding: utf-8 -*-
import re
str = 'hi… » Test'
str = 're of… » Pr'
str = 're of… » Pr | removepipeaswell'
print str
str = re.sub(r' [^a-z0-9]+ ', ' ', str , re.UNICODE|re.MULTILINE)
# str = re.sub(r' [^\p{Alpha}] ', ' ', str, re.UNICODE)
print str
're of… Pr removepipeaswell' #expected output
str_nbsp = 'afds » asf'
edit: added another test string, i dont want to remove the "of..." (unicode dots), i want to remove multiple unicode (non-word) chars only.
edit: using this works for the test case, (but not in the full html??? - it only appears to replace matches to the first half to the string, then ignores the rest.)
str = re.sub(r' [^a-z0-9]+ ', ' ', str , re.UNICODE|re.MULTILINE)
edit: fml, it had to be something stupid like not reading the argument list properly: http://bytes.com/topic/python/answers/689341-sub-does-not-replace-all-occurences
[whoever just deleted their response - thank you for your help.]
str = re.sub(r' [^a-z0-9]+ ', ' ', str)
The final test string "str_nbsp" did not match the regex above. One of the space characters is actually a non breaking space character. I used www.regexr.com and hovered over each character to figure this out.
Upvotes: 2
Views: 3730