Hei
Hei

Reputation: 1894

Need to Escape the Character After Special Characters in Python's regex?

I have the following python code:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import re
line = 'div><div class="fieldRow jr_name"><div class="fieldLabel">name<'
regex0 = re.compile('(.+?)\v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
regex1 = re.compile('(.+?)v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
regex2 = re.compile('(.+?) class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)

m0 = regex0.match(line)
m1 = regex1.match(line)
m2 = regex2.match(line)

if m0:
    print 'regex0 is good'
else:
    print 'regex0 is no good'

if m1:
    print 'regex1 is good'
else:
    print 'regex1 is no good'

if m2:
    print 'regex2 is good'
else:
    print 'regex2 is no good'

The output is

regex0 is good
regex1 is no good
regex2 is good

I don't quite understand why I need to escape the character 'v' after "(.+?)" in regex0. If I don't escape, which will become regex1, then the matching will fail. However, for space right after "(.+?)" in regex3, I don't have to escape.

Any idea?

Thanks in advance.

Upvotes: 1

Views: 799

Answers (2)

jsbueno
jsbueno

Reputation: 110746

So, there are some issues with your approach The ones that contribute to your specific complaint are:

  • You do not mark te regexp string as raw (r' prefix) - that makes the Python compiler change some "\" prefixed characters inside the string before they even reach the re.match call.
  • "\v" happens to be one such character - a vertical tab that is replaced by "\0x0b"
  • You use the "re.VERBOSE" flag - that simply tells the regexp engine to ignore any whitesapce character. "\v" being a vertical tab is one character in this class and is ignored.

So, there is your match for regex0: the letter "v" os never seem as such.

Now, for the possible fixes on you approach, in the order that you should be trying to do them:

1) Don't use regular expressions to parse HTML. Really. There are a lot of packages that can do a good job on parsing HTML, and in missing those you can use stdlib's own HTMLParser (html.parser in Python3);

2) If possible, use Python 3 instead of Python 2 - you will be bitten on the first non-ASCII character inside yourt HTML body if you go on with the naive approach of treating Python2 strings as "real life" text. Python 3 automatic encoding handling (and explicit settings allowed to you when it is not automatic) .

Since you are probably not changing anyway, so try to use regex.findall instead of regex.match - this returns a list of matching strings and could retreive the attributes you are looking at once, without searching from the beggining of the file, or depending on line-breaks inside the HTML.

Upvotes: 3

tigrank
tigrank

Reputation: 31

There is a special symbol in Python regex \v, about which you can read here: https://docs.python.org/2/library/re.html

Python regex usually are written in r'your regex' block, where "r" means raw string. (https://docs.python.org/3/reference/lexical_analysis.html)

In your code all special characters should be escaped to be understood as normal characters. E.g. if you write s - this is space, \s is just "s". To make it work in an opposite way use raw strings. The line below is the solution you need, I believe.

regex1 = re.compile(r'(.+?)v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)

Upvotes: 0

Related Questions