Reputation: 240
I am trying to sort through HTML tags and I can't seem to get it right.
What I have done so far
import urllib
import re
s = raw_input('Enter URL: ')
f = urllib.urlopen(s)
s = f.read()
f.close
r = re.compile('<TAG\b[^>]*>(.*?)</TAG>',)
result = re.findall(r, s)
print(result)
Where I replace "TAG" with tag I want to see.
Thanks in advance.
Upvotes: 1
Views: 170
Reputation: 1569
An example from BS is this
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')
[<b>one</b>, <b>two</b>]
As for a regular expression, you can use
aa = doc[0]
aa
'<html><head><title>Page title</title></head>'
pt = re.compile('(?<=<title>).*?(?=</title>)')
re.findall(pt,aa)
['Page title']
Upvotes: 1
Reputation: 13088
I'm not entirely clear on what you are trying to achieve with the regex. Capturing the contents between two div tags for instance works with
re.compile("<div.*?>.*?</div>")
Although you will run into some problems with nested divs with the above one.
Upvotes: 1
Reputation: 51
You should really try using libraries which can perform HTML parsing out of the box. Beautiful Soup is one of my favorites.
Upvotes: 5