Sindre Smistad
Sindre Smistad

Reputation: 139

Python regex problem

I'm trying to match some html with a regex, and the regex works fine if like this:

import re

reg = r";!--\"\'<[a-i0-9]{8}>=&\{\(\)\}"

html_data = "some html data"

if re.search(reg, html_data):
    print("Match")

But if it get's the html data either from reading a local file or getting it from the web it fails. I've downloaded the html page from the server, then copy pasted the source into the script and it works fine. But reading directly from file or the server does not work.

I've also checked the local file with a hex editor to verify that there isn't some special char that is screwing me over.

Example of string to be matched:

<input type="text" value=";!--\"\'<a41cgb04>=&{()}" name="url" maxlength="200" class="url" style="width:495px;"> 

Where ;!--\"\'<a41cgb04>=&{()} is what that should be matched.

Upvotes: 0

Views: 301

Answers (4)

eyquem
eyquem

Reputation: 27575

For me, your problem is due to your erroneous interpretation of this:

<input type="text" value=";!--\"\'<a41cgb04>=&{()}" name="url" maxlength="200" class="url" style="width:495px;">

You think that the backslashes in front of " and ' are in the source code. But I think that one of the two is in fact an artefact of displaying: it is not present in the HTML code in reality.

I don't know how you obtain the above sequence of characters.
But I think the phenomenon is the same as the one observed when using repr():
there are backslashes in the display that are used by the displayer to make you understand what is in the sequence of characters, but in reality all the backslashes are not in the value of the string displayed

You'll better understand what I mean with this:

a = "abc ' def "

b = ' ABC " DEF'

print repr(a + b)

result

'abc \' def  ABC " DEF'

.

Update

The following web page as exemple:

http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/

.

Doing 'Display the source code' on this page produces a display in which the 13th line is

<meta name="abstract" content="Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia" />

Now, executing the following code

from urllib import urlopen


url = 'http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/'

sock = urlopen(url)
srce = sock.read()
sock.close()


li = srce.splitlines(True)

print 'Displayed normally:\n-------------------\n'
print '\n'.join(li[12:14])
print

print 'Displayed with the help of repr():\n----------------------\n'
print '\n'.join(map(repr,li[12:14]))
print

print 'Displayed in a list:\n--------------------\n'
print li[12:14]

produces the result:

Displayed normally:
-------------------

<meta name="abstract" content="Heronswood Bergenia 'Lunar Glow' PP20247 in  Bergenia" />

<meta name="allow-search" content="YES" />


Displayed with repr():
----------------------

'<meta name="abstract" content="Heronswood Bergenia \'Lunar Glow\' PP20247 in  Bergenia" />\n'
'<meta name="allow-search" content="YES" />\n'

Displayed in a list:
--------------------

['<meta name="abstract" content="Heronswood Bergenia \'Lunar Glow\' PP20247 in  Bergenia" />\n', '<meta name="allow-search" content="YES" />\n']

Displaying the source code normally has a consequence: special character like '\n', '\r' , '\t' are not seen and it isn't easy to write a regex's pattern.
That's why analyzing an HTML source is facilitated with the display of the strings without interpretation.

So, displaying the source code with repr() or in a list shows all the characters explicitly.
The only inconvenience is that sometimes, characters ' in the middle of the string are escaped because it is the way these characters must be defined in a string when this string is defined with quotes ' at the beginning and the end. When a list is displayed, its elements are displayed on the screen with the help of repr(), that why the instruction print li[12:14] displays the elements under the same form than the instruction print '\n'.join(map(repr,li[12:14])). In fact, repr() displays a string having a certain value as this string would be defined to give it the said value.

.

In the end, what I want to underline is that : if someone defines a regex's pattern with "\\\\'" or r"\\'" because he believes that there is a character \ before a character ' because of the display of a source code with repr() , he does incorrect pattern.

The codes that follows explains this better, I hope:

 import re
from urllib import urlopen


url = 'http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/'

sock = urlopen(url)
srce = sock.read()
sock.close()


pat = '<meta name="abstract" content="(Heronswood Bergenia (\'Lunar Glow\')? [a-zA-Z]+\d+ .*?)" />'
regx = re.compile(pat)
print regx.search(srce).groups()

pat = "<meta name=\"abstract\" content=\"(Heronswood Bergenia (\\\\'Lunar Glow\\\\')? [a-zA-Z]+\d+ .*?)\" />"
regx = re.compile(pat)
print regx.search(srce).groups()

result

("Heronswood Bergenia 'Lunar Glow' PP20247 in  Bergenia", "'Lunar Glow'")

Traceback (most recent call last):
  File "I:\trez.py", line 18, in <module>
    print regx.search(srce).groups()
AttributeError: 'NoneType' object has no attribute 'groups'

Upvotes: 2

WombatPM
WombatPM

Reputation: 2609

I'd change your regex since you are in backslash hell. This expression works using a file.

 reg = ";!--....<[a-i0-9]{8}>=&\{\(\)\}"

In breaking your expression down into parts:

reg = ";!--"  Matches
reg = ";!--\\" throws an error regarding bogus end of line escape.  

Python does not like \'s at the end of strings escaped or otherwise.

As the saying goes:
A developer has a problem and thinks "I'll solve it with regular expressions".

Now the developer has two problems.

Upvotes: 0

Jason Orendorff
Jason Orendorff

Reputation: 45086

The backslash character, \, has a special meaning in regular expressions. If you want to match a backslash in the text, you have to write \\ in the regular expression:

reg = r";!--\\"\\'<[a-i0-9]{8}>=&\{\(\)\}"

Upvotes: 0

Boogle
Boogle

Reputation: 33

Perhaps this http://docs.python.org/library/htmlparser.html will be more useful to you than trying to use a regex. I tend to agree with Mark Pilgrim that using regex gives you two problems, regex and your original issue.

Upvotes: 0

Related Questions