Reputation: 13
I'm using a regex to strip "bullet points" from text. These bullet points are often symbols found in unicode ranges such as geometric shape (\u25a0-\u25ff) or similar. Below is an example of such bullets:
◉ This is a bullet ♦︎ This is also a bullet ☉ And so is this This is not a bullet.
I'm using the following regular expression to match these bullet points:
\s*([\u00a4\u00b7]|[\u2010-\u2017]|[\u2020-\u206f]|[\u2300-\u23f3]|[\u25a0-\u25ff]|[\u2600-\u26ff]|[\u2700-\u27bf]|[\u2b00-\u2bff])\s*
This works in Ruby (see an example at http://rubular.com/r/O7ZObURmlt), but in Python it matches the first character of any string. For example the T
character in the string This is not a bullet
is matched. You can copy the above regex and example text to http://www.pythonregex.com/ to see this for yourselves.
The regex is compiled with the UNICODE
flag.
How can I make Python's regex engine play nice with this expression?
Upvotes: 1
Views: 1724
Reputation: 14778
Make the string that generates your expression be in unicode, so that the sequences are interpreted as unicode characters, instead of plain u
, 2
, 0
, and so on. Try the following:
regex = re.compile(u"\s*([\u00a4\u00b7]|[\u2010-\u2017]|" + \
"[\u2020-\u206f]|[\u2300-\u23f3]|[\u25a0-\u25ff]|" + \
"[\u2600-\u26ff]|[\u2700-\u27bf]|[\u2b00-\u2bff])\s*", re.UNICODE)
And you're most probably not using Python 3.*, in which all strings are unicode AFAIK.
Upvotes: 1