Reputation: 283
So I've got a table:
<table border="1" style="width: 100%">
<caption></caption>
<col>
<col>
<tbody>
<tr>
<td>Pig</td>
<td>House Type</td>
</tr>
<tr>
<td>Pig A</td>
<td>Straw</td>
</tr>
<tr>
<td>Pig B</td>
<td>Stick</td>
</tr>
<tr>
<td>Pig C</td>
<td>Brick</td>
</tr>
And I was simply trying to return a JSON string of the table pairs like so:
[["Pig A", "Straw"], ["Pig B", "Stick"], ["Pig C", "Brick"]]
However, with my code I can't seem to get rid of the HTML tags:
stable = soup.find('table')
cells = [ ]
rows = stable.findAll('tr')
for tr in rows[1:4]:
# Process the body of the table
row = []
td = tr.findAll('td')
#td = [el.text for el in soup.tr.finall('td')]
row.append( td[0])
row.append( td[1])
cells.append( row )
return cells
#eventually, I'd like to do this: #h = json.dumps(cells) #return h
My output is this:
[[<td>Pig A</td>, <td>Straw</td>], [<td>Pig B</td>, <td>Stick</td>], [<td>Pig C</td>, <td>Brick</td>]]
Upvotes: 1
Views: 362
Reputation: 5275
You can try using lxml library.
from lxml.html import fromstring
import lxml.html as PARSER
#data = open('example.html').read() # You can read it from a html file.
#OR
data = """
<table border="1" style="width: 100%">
<caption></caption>
<col>
<col>
<tbody>
<tr>
<td>Pig</td>
<td>House Type</td>
</tr>
<tr>
<td>Pig A</td>
<td>Straw</td>
</tr>
<tr>
<td>Pig B</td>
<td>Stick</td>
</tr>
<tr>
<td>Pig C</td>
<td>Brick</td>
</tr>
"""
root = PARSER.fromstring(data)
main_list = []
for ele in root.getiterator():
if ele.tag == "tr":
text = ele.text_content().strip().split('\n')
main_list.append(text)
print main_list
Output: [['Pig', ' House Type'], ['Pig A', ' Straw'], ['Pig B', ' Stick'], ['Pig C', ' Brick']]
Upvotes: 0
Reputation: 13260
Use the text
property to get only the inner text of the element:
row.append(td[0].text)
row.append(td[1].text)
Upvotes: 2