Reputation: 139
I'm trying to match some html with a regex, and the regex works fine if like this:
import re
reg = r";!--\"\'<[a-i0-9]{8}>=&\{\(\)\}"
html_data = "some html data"
if re.search(reg, html_data):
print("Match")
But if it get's the html data either from reading a local file or getting it from the web it fails. I've downloaded the html page from the server, then copy pasted the source into the script and it works fine. But reading directly from file or the server does not work.
I've also checked the local file with a hex editor to verify that there isn't some special char that is screwing me over.
Example of string to be matched:
<input type="text" value=";!--\"\'<a41cgb04>=&{()}" name="url" maxlength="200" class="url" style="width:495px;">
Where ;!--\"\'<a41cgb04>=&{()}
is what that should be matched.
Upvotes: 0
Views: 301
Reputation: 27575
For me, your problem is due to your erroneous interpretation of this:
<input type="text" value=";!--\"\'<a41cgb04>=&{()}" name="url" maxlength="200" class="url" style="width:495px;">
You think that the backslashes in front of " and ' are in the source code. But I think that one of the two is in fact an artefact of displaying: it is not present in the HTML code in reality.
I don't know how you obtain the above sequence of characters.
But I think the phenomenon is the same as the one observed when using repr():
there are backslashes in the display that are used by the displayer to make you understand what is in the sequence of characters, but in reality all the backslashes are not in the value of the string displayed
You'll better understand what I mean with this:
a = "abc ' def "
b = ' ABC " DEF'
print repr(a + b)
result
'abc \' def ABC " DEF'
.
The following web page as exemple:
http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/
.
Doing 'Display the source code' on this page produces a display in which the 13th line is
<meta name="abstract" content="Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia" />
Now, executing the following code
from urllib import urlopen
url = 'http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/'
sock = urlopen(url)
srce = sock.read()
sock.close()
li = srce.splitlines(True)
print 'Displayed normally:\n-------------------\n'
print '\n'.join(li[12:14])
print
print 'Displayed with the help of repr():\n----------------------\n'
print '\n'.join(map(repr,li[12:14]))
print
print 'Displayed in a list:\n--------------------\n'
print li[12:14]
produces the result:
Displayed normally:
-------------------
<meta name="abstract" content="Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia" />
<meta name="allow-search" content="YES" />
Displayed with repr():
----------------------
'<meta name="abstract" content="Heronswood Bergenia \'Lunar Glow\' PP20247 in Bergenia" />\n'
'<meta name="allow-search" content="YES" />\n'
Displayed in a list:
--------------------
['<meta name="abstract" content="Heronswood Bergenia \'Lunar Glow\' PP20247 in Bergenia" />\n', '<meta name="allow-search" content="YES" />\n']
Displaying the source code normally has a consequence: special character like '\n', '\r' , '\t' are not seen and it isn't easy to write a regex's pattern.
That's why analyzing an HTML source is facilitated with the display of the strings without interpretation.
So, displaying the source code with repr() or in a list shows all the characters explicitly.
The only inconvenience is that sometimes, characters ' in the middle of the string are escaped because it is the way these characters must be defined in a string when this string is defined with quotes ' at the beginning and the end. When a list is displayed, its elements are displayed on the screen with the help of repr(), that why the instruction print li[12:14]
displays the elements under the same form than the instruction print '\n'.join(map(repr,li[12:14]))
. In fact, repr() displays a string having a certain value as this string would be defined to give it the said value.
.
In the end, what I want to underline is that :
if someone defines a regex's pattern with "\\\\'"
or r"\\'"
because he believes that there is a character \ before a character ' because of the display of a source code with repr() , he does incorrect pattern.
The codes that follows explains this better, I hope:
import re
from urllib import urlopen
url = 'http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/'
sock = urlopen(url)
srce = sock.read()
sock.close()
pat = '<meta name="abstract" content="(Heronswood Bergenia (\'Lunar Glow\')? [a-zA-Z]+\d+ .*?)" />'
regx = re.compile(pat)
print regx.search(srce).groups()
pat = "<meta name=\"abstract\" content=\"(Heronswood Bergenia (\\\\'Lunar Glow\\\\')? [a-zA-Z]+\d+ .*?)\" />"
regx = re.compile(pat)
print regx.search(srce).groups()
result
("Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia", "'Lunar Glow'")
Traceback (most recent call last):
File "I:\trez.py", line 18, in <module>
print regx.search(srce).groups()
AttributeError: 'NoneType' object has no attribute 'groups'
Upvotes: 2
Reputation: 2609
I'd change your regex since you are in backslash hell. This expression works using a file.
reg = ";!--....<[a-i0-9]{8}>=&\{\(\)\}"
In breaking your expression down into parts:
reg = ";!--" Matches
reg = ";!--\\" throws an error regarding bogus end of line escape.
Python does not like \'s at the end of strings escaped or otherwise.
As the saying goes:
A developer has a problem and thinks
"I'll solve it with regular expressions".
Now the developer has two problems.
Upvotes: 0
Reputation: 45086
The backslash character, \
, has a special meaning in regular expressions. If you want to match a backslash in the text, you have to write \\
in the regular expression:
reg = r";!--\\"\\'<[a-i0-9]{8}>=&\{\(\)\}"
Upvotes: 0
Reputation: 33
Perhaps this http://docs.python.org/library/htmlparser.html will be more useful to you than trying to use a regex. I tend to agree with Mark Pilgrim that using regex gives you two problems, regex and your original issue.
Upvotes: 0