HoopsMcCann
HoopsMcCann

Reputation: 347

Python Regular Expression Finding a Specific Text Between Headers

I'm just starting to learn about regular expressions in Python, and I've made a bit of progress on what I want to get done.

import urllib.request
import urllib.parse
import re

x = urllib.request.urlopen("http://www.SOMEWEBSITE.com")
contents = x.read()

paragraphs = re.findall(r'<p>(.*?)</p>', str(contents))

So with that regular expression I'm able to find everything between the paragraph headers, but what if I want to find paragraphs with specific words in them? For example, parse all paragraphs that have the word "cat" in them. I know that (.*?) find everything, but I'm just a bit lost on the intuition on finding a paragraph with a specific keyword.

Anyway, thanks.

Upvotes: 1

Views: 517

Answers (1)

dbosky
dbosky

Reputation: 1641

It's better to use BeautifulSoup. Example:

import urllib2
html = urllib2.urlopen("http://www.SOMEWEBSITE.com").read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)

# now you can search the soup

Documentation:

BeautifulSoup Doc

But... if regex has to be used:

>>> str = "<p>This is some cat in a paragraph.</p>"
>>> re.findall(r'<p>(.*cat.*)</p>', str)
['This is some cat in a paragraph.']

Upvotes: 4

Related Questions