Reputation: 1
Ok please be gentle - this is my first stackoverflow question and I've struggled with this for a few hours. I'm sure the answer is something obvious, staring me in the face but I give up.
I'm trying to grab an element from a webpage (ie determine gender of a name) from a name website.
The python code I've written is here:
import re
import urllib2
response=urllib2.urlopen("http://www.behindthename.com/name/janet")
html=response.read()
print html
patterns = ['Masculine','Feminine']
for pattern in patterns:
print "Looking for %s in %s<<<" % (pattern,html)
if re.findall(pattern,html):
print "Found a match!"
exit
else:
print "No match!"
When I dump html I see Feminine there, but the re.findall isn't matching. What in the world am I doing wrong?
Upvotes: 0
Views: 82
Reputation: 474171
Do not parse an HTML with regex, use a specialized tool - an HTML parser.
Example using BeautifulSoup
:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://www.behindthename.com/name/janet'
soup = BeautifulSoup(urlopen(url))
print soup.select('div.nameinfo span.info')[0].text # prints "Feminine"
Or, you can find an element by text:
gender = soup.find(text='Feminine')
And then, see if it is None
(not found) or not: gender is None
.
Upvotes: 1