Reputation: 1806
i am working on a regex match function in python. i have the following code:
def src_match(line, img):
imgmatch = re.search(r'<img src="(?P<img>.*?)"', line)
if imgmatch and imgmatch.groupdict()['img'] == img:
print 'the match was:', imgmatch.groupdict()['img']
the above does not seem to operate correctly for me at all. i do on the other hand have luck with this:
def href_match(line, url):
hrefmatch = re.search(r'<a href="(?P<url>.*?)"', line)
if hrefmatch and hrefmatch.groupdict()['url'] == url:
print 'the match was:', hrefmatch.groupdict()['url']
else:
return None
can someone please explain why this would be (or if maybe it seems like both should work)? for ex., is there something special about the identifier in the href_match() function? it can be assumed in both functions that i am passing both a line in that contains the string i am searching for, and the string itself.
EDIT: i should mention that i am sure i will never get a tag like:
<img width="200px" src="somefile.jpg">
the reason for this is that i am using a specific program which is generating the html and it will never yield a tag as such. this example should be taken as purely theoretical within the assumptions that i am always going to get a tag like:
<img src="somefile.jpg">
EDIT:
here is an example of a line that i am feeding to the function which does not match the input argument:
<p class="p1"><img src="myfile.anotherword.png" alt="beat-divisions.tiff"></p>
Upvotes: 1
Views: 1342
Reputation: 56654
Rule #37: do not attempt parsing HTML with regex.
Use the right tool for the job - in this case, BeautifulSoup.
Edit:
cut-and-pasting the function and testing as
>>> src_match('this is <img src="my example" />','my example')
the match was: my example
so it appears to function; however it will fail on (perfectly valid) HTML code like
<img width="200px" src="Y U NO C ME!!" />
Edit4:
>>> src_match('<p class="p1"><img src="myfile.png" alt="beat-divisions.tiff"></p>','myfile.png')
the match was: myfile.png
>>> src_match('<p class="p1"><img src="myfile.anotherword.png" alt="beat-divisions.tiff"</p>\n','myfile.anotherword.png')
the match was: myfile.anotherword.png
still works; are you sure the url value you are trying to match against is correct?
Upvotes: 1