Reputation: 512
I have a HTML file with the following data structure:
<tr>
<td valign="top"><img src="img.jpg"></td>
<td><a href="file.zip">file.zip</a></td>
<td align="right">24-Apr-2013 12:42 </td>
<td align="right">200K</td>
</tr>
...
It's basically a simple table and when viewed in Firefox it looks like this:
file.zip 22-Apr-2013 12:42 200K
I want to extract this three values (file name, date, size) and I could do it e.g. with split()
but I am wondering if it is possible to print "the html interpreted form" of this in python?
import xyz
print xyz.htmlinterpreted(htmlfile.html)
>>> file.zip 22-Apr-2013 12:42 200K
That way I could easiely split the data with split(" ")
. Is this possible in python?
Upvotes: 0
Views: 95
Reputation: 1125058
Use a HTML parser. BeautifulSoup makes this a breaze:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_source)
print list(soup.stripped_strings)
Demo:
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('''<tr><td valign="top"><img src="img.jpg"></td><td><a href="file.zip">file.zip</a></td><td align="right">24-Apr-2013 12:42 </td><td align="right">200K</td></tr>''')
>>> print list(soup.stripped_strings)
[u'file.zip', u'24-Apr-2013 12:42', u'200K']
Upvotes: 1