Ted Avey
Ted Avey

Reputation: 1

Python re regex matching issue

Ok please be gentle - this is my first stackoverflow question and I've struggled with this for a few hours. I'm sure the answer is something obvious, staring me in the face but I give up.

I'm trying to grab an element from a webpage (ie determine gender of a name) from a name website.

The python code I've written is here:

import re
import urllib2

response=urllib2.urlopen("http://www.behindthename.com/name/janet")
html=response.read()
print html

patterns = ['Masculine','Feminine']

for pattern in patterns:
print "Looking for %s in %s<<<" % (pattern,html)

    if re.findall(pattern,html):
        print "Found a match!"
        exit
    else:
        print "No match!"

When I dump html I see Feminine there, but the re.findall isn't matching. What in the world am I doing wrong?

Upvotes: 0

Views: 82

Answers (1)

alecxe
alecxe

Reputation: 474171

Do not parse an HTML with regex, use a specialized tool - an HTML parser.

Example using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://www.behindthename.com/name/janet'
soup = BeautifulSoup(urlopen(url))

print soup.select('div.nameinfo span.info')[0].text  # prints "Feminine"

Or, you can find an element by text:

gender = soup.find(text='Feminine')

And then, see if it is None (not found) or not: gender is None.

Upvotes: 1

Related Questions