Krayons
Krayons

Reputation: 240

Python Regex Help

I am trying to sort through HTML tags and I can't seem to get it right.

What I have done so far

import urllib
import re

s = raw_input('Enter URL: ')
f = urllib.urlopen(s) 
s = f.read() 
f.close 
r = re.compile('<TAG\b[^>]*>(.*?)</TAG>',)
result = re.findall(r, s)
print(result)

Where I replace "TAG" with tag I want to see.

Thanks in advance.

Upvotes: 1

Views: 170

Answers (3)

gerry
gerry

Reputation: 1569

An example from BS is this

from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')
[<b>one</b>, <b>two</b>]

As for a regular expression, you can use

aa = doc[0]
aa
'<html><head><title>Page title</title></head>'
pt = re.compile('(?<=<title>).*?(?=</title>)')
re.findall(pt,aa)
['Page title']

Upvotes: 1

Matti Lyra
Matti Lyra

Reputation: 13088

I'm not entirely clear on what you are trying to achieve with the regex. Capturing the contents between two div tags for instance works with

re.compile("<div.*?>.*?</div>")

Although you will run into some problems with nested divs with the above one.

Upvotes: 1

Miguel
Miguel

Reputation: 51

You should really try using libraries which can perform HTML parsing out of the box. Beautiful Soup is one of my favorites.

Upvotes: 5

Related Questions