Reputation: 387
I'm trying to find a token out of a string and return it. I am using this method on other strings and it works fine, but this one does not seem to return any result. Not for findall and not for search.
pattern= re.compile(r'<input class="token" value="(.+?)" name="csrftoken_reply">')
matches = pattern.findall(htmlstring)
for match in matches:
print match
There is only one value in each response string. though I am not getting a result for "print match"
I also tried using re.search but same thing happens, a NoneType object is returned...
MORE INFO:
this is part of the html i'm parsing:
<form id="threadReplyForm" class="clearfix" method="post" action="/go/messages/private/threadID=0551796">
<input class="csrftoken" type="hidden" value="a7b161b7" name="csrftoken_reply">
<input type="hidden" value="reply" name="action">
<div class="editorWrapper">
<div id="premiumSmiliesNotAllowed" class="warning" style="display: none;">
<div id="editor_13" class="clearfix editor" mode="full">
<ul id="editorToolbar_13" class="editorToolbar clearfix">
<textarea id="messageInput" class="autogrow" cols="20" rows="8" name="message"></textarea>
<div id="previewDiv" class="previewArea" style="display: none;"></div>
</div>
<script>
</div>
<script>
<span class="loadingIndicator right loadingIndicatorMessage">
<p class="clearfix">
</form>
parsing it with this :
pattern= re.compile(r'<input class="csrftoken" type="hidden" value="(.+?)" name="csrftoken_reply">')
matches = pattern.findall(str(response.read()))
for match in matches:
print match
trying to get a7b161b7 as output
Upvotes: 0
Views: 1874
Reputation:
Not a Python person and not recommending regex to parse html, but it might be
possible to get unordered att-val data this way. Just put in some pairs that is
needed to qualify the tag. It doesn't have to be all of them or in any order.
Modifiers: expanded, single-line string, global.
The value capture group is $5
Edit
Changed (?= (?:".*?"|\'.*?\'|[^>]*?)+
to (?= (?:[^>"\']|(?>".*?"|\'.*?\'))*?
because lazy quantifier in this form will be forced to overrun markup boundries to satisfy the lookahead. The new sub-expression handles attr="so< m >e"
embedded markup, without overruns.
<input
(?=\s)
(?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) class \s*=\s* ([\'"]) \s* csrftoken \s*\1 )
(?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) name \s*=\s* ([\'"]) \s* csrftoken_reply \s*\2 )
(?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) type \s*=\s* ([\'"]) \s* hidden \s*\3 )
(?= (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) value \s*=\s* ([\'"]) \s* (.*?) \s*\4 )
\s+ (?:".*?"|\'.*?\'|[^>]*?)+ (?<!/)
>
All the caveats apply, could be hidden in imbedded code, could be comments, etc ...
Extra regex logic is needed for that.
Upvotes: 0
Reputation: 1823
Sorry, parsing HTML with regex in 2011 is borderline insanity :) the number of libraries optimized for this task is quite large, the best ones being the above-mentioned BeautifulSoup and lxml; I can understand you wouldn't want to deal with lxml because of its list of dependencies and messy installation, but BeautifulSoup is one file and would make your code so much more robust.
TL;DR: you're reinventing the wheel.
Upvotes: 0
Reputation: 16107
You'll have to give an example of the string you are trying to parse, because this works for me.
import re
htmlstring = """
<input class="token" value="foo" name="csrftoken_reply">
"""
pattern= re.compile(r'<input class="token" value="(.+?)" name="csrftoken_reply">')
matches = pattern.findall(htmlstring)
for match in matches:
print match
Beyond that, have you considered using a library designed for something like this? Regex's can be a big fragile when it comes to parsing html. Beautiful Soup seems to be a popular tool for this job.
Update
You've got a wrong class value, an extra space, and you forgot the 'input type="hidden"'. Here's something closer, though I would still discourage use of regex for this:
r'<input class="csrftoken" type="hidden" value="(.+?)" name="csrftoken_reply">'
this works as well (I'm assuming there's one one 'csrftoken_reply' element):
r'value="(.+?)" name="csrftoken_reply">'
Both of these work for me to get your desired value.
Upvotes: 1