Reputation: 103
I have an html page that looks like this
<tr>
<td align=left>
<a href="history/2c0b65635b3ac68a4d53b89521216d26.html">
<img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
</a>
<a href="history/2c0b65635b3ac68a4d53b89521216d26_0.html" title="C.">Th</a>
</td>
</tr>
<tr align=right>
<td align=left>
<a href="marketing/3c0a65635b2bc68b5c43b88421306c37.html">
<img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
</a>
<a href="marketing/3c0a65635b2bc68b5c43b88421306c37_0.html" title="b">aa</a>
</td>
</tr>
I need to get the text
history/2c0b65635b3ac68a4d53b89521216d26.html marketing/3c0a65635b2bc68b5c43b88421306c37.html
I wrote a script in python that uses regular expressions
import re
a = re.compile("[0-9 a-z]{0,15}/[0-9 a-f]{32}.html")
print(a.match(s))
where s
's value is the html page above. However when I use this script I get "None"
. Where did I go wrong?
Upvotes: 0
Views: 348
Reputation: 474191
Don't use regex for parsing HTML content.
Use a specialized tool - an HTML Parser.
Example (using BeautifulSoup
):
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
data = u"""Your HTML here"""
soup = BeautifulSoup(data)
for link in soup.select('td a[href]'):
print link['href']
Prints:
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html
Or, if you want to get the href
values that follow a pattern, use:
import re
for link in soup.find_all('a', href=re.compile(r'\w+/\w{32}\.html')):
print link['href']
where r'\w+/\w{32}\.html'
is a regular expression that would be applied to an href
attribute of every a
tag found. It would match one or more alphanumeric characters (\w+
), followed by a slash, followed by exactly 32 alphanumeric characters (\w{32}
), followed by a dot (\.
- needs to be escaped), followed by html
.
Upvotes: 3
Reputation: 26677
You can also write something like
>>> soup = BeautifulSoup(html) #html is the string containing the data to be parsed
>>> for a in soup.select('a'):
... print a['href']
...
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html
Upvotes: 2