Reputation: 9348
I am parsing a table in saved .html document, which looks like:
the html codes are like:
<table id="detailBody" width="100%" cellspacing="0" cellpadding="0" border="0" class="tab2" style="display: block;"><tbody>
<tr><td><ul><li><span>15:00:19</span><span class="red">11.750</span><span class="red">5392</span><span class="fr red">↑</span></li><li><span>14:56:55</span><span class="red">11.750</span><span class="red">17</span><span class="fr red">↑</span></li><li><span>14:56:52</span><span class="red">11.750</span><span class="red">479</span><span class="fr red">↑</span></li><li><span>14:56:49</span><span class="">11.740</span><span class="green">6</span><span class="fr green">↓</span></li><li><span>14:56:46</span><span class="">11.740</span><span class="green">333</span><span class="fr green">↓</span></li><li><span>14:56:43</span><span class="">11.740</span><span class="green">21</span><span class="fr green">↓</span></li><li><span>14:56:40</span><span class="">11.740</span><span class="green">15</span><span class="fr green">↓</span></li><li><span>14:56:37</span><span class="">11.740</span><span class="green">35</span><span class="fr green">↓</span></li><li><span>14:56:34</span><span class="red">11.750</span><span class="red">11</span><span class="fr red">↑</span></li><li><span>14:56:31</span><span class="">11.740</span><span class="green">3</span><span class="fr green">↓</span></li><li><span>14:56:28</span><span class="">11.740</span><span class="green">24</span><span class="fr green">↓</span></li><li><span>14:56:22</span><span class="red">11.750</span><span class="red">291</span><span class="fr red">↑</span></li><li><span>14:56:19</span><span class="">11.740</span><span class="red">198</span><span class="fr red">↑</span></li><li><span>14:56:16</span><span class="green">11.730</span><span class="green">15</span><span class="fr green">↓</span></li></ul></td></tr>
</tbody></table>
What I have so far is:
list_a = soup.find_all('table')[0].tbody.find_all("tr")
for a in list_a:
for b in a:
for c in b:
for d in c:
for e in d:
print e.renderContents()
even though it doesn't looked very nice, the result is like:
15:00:19
11.750
5392
↑
14:56:55
11.750
17
↑
14:56:52
11.750
479
↑
However there are too many contents in the table, I only want the first 10 groups of data in the table. And only the 3rd and 4th items to be put in 2 lists.
i.e.
[“5392”, “17”, “479”, …]
and
[“↑”, “↑”, “↑”, …] #the “↑” can be changed to something else identical if it's a problem
how can I achieve that? Thanks.
Upvotes: 0
Views: 83
Reputation: 46759
The following will extract your two columns using the span
tag inside the li
elements:
html = """
<table id="detailBody" width="100%" cellspacing="0" cellpadding="0" border="0" class="tab2" style="display: block;">
<tbody>
<tr>
<td>
<ul>
<li><span>15:00:19</span><span class="red">11.750</span><span class="red">5392</span><span class="fr red">?</span></li>
<li><span>14:56:55</span><span class="red">11.750</span><span class="red">17</span><span class="fr red">?</span></li>
<li><span>14:56:52</span><span class="red">11.750</span><span class="red">479</span><span class="fr red">?</span></li>
<li><span>14:56:49</span><span class="">11.740</span><span class="green">6</span><span class="fr green">?</span></li>
<li><span>14:56:46</span><span class="">11.740</span><span class="green">333</span><span class="fr green">?</span></li>
<li><span>14:56:43</span><span class="">11.740</span><span class="green">21</span><span class="fr green">?</span></li>
<li><span>14:56:40</span><span class="">11.740</span><span class="green">15</span><span class="fr green">?</span></li>
<li><span>14:56:37</span><span class="">11.740</span><span class="green">35</span><span class="fr green">?</span></li>
<li><span>14:56:34</span><span class="red">11.750</span><span class="red">11</span><span class="fr red">?</span></li>
<li><span>14:56:31</span><span class="">11.740</span><span class="green">3</span><span class="fr green">?</span></li>
<li><span>14:56:28</span><span class="">11.740</span><span class="green">24</span><span class="fr green">?</span></li>
<li><span>14:56:22</span><span class="red">11.750</span><span class="red">291</span><span class="fr red">?</span></li>
<li><span>14:56:19</span><span class="">11.740</span><span class="red">198</span><span class="fr red">?</span></li>
<li><span>14:56:16</span><span class="green">11.730</span><span class="green">15</span><span class="fr green">?</span></li>
</ul>
</td>
</tr>
</tbody></table>"""
soup = BeautifulSoup(html)
col_3 = []
col_4 = []
for li in soup.find_all('table')[0].find_all("li"):
cols = li.find_all("span")
col_3.append(cols[2].text)
col_4.append(cols[3].text)
print col_3
print col_4
This would give you the following output:
[u'5392', u'17', u'479', u'6', u'333', u'21', u'15', u'35', u'11', u'3', u'24', u'291', u'198', u'15']
[u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?']
Upvotes: 1
Reputation: 143
Why didn't you tried to find all span items directly because that is what you actually want or not? So instead of
list_a = soup.find_all('table')[0].tbody.find_all("tr")
try
list_a = soup.find_all('table')[0].tbody.find_all("tr")[0].find_all("span")
I don't know if you're table only has one row. If yes this shoudl work and give you all the spans and you just skip the one's you do not need. If you got multiple rows you have to iterate over the rows like this
list_a = soup.find_all('table')[0].tbody.find_all("tr")
for a in list_a:
a.find_all("span")
and again you will get all span items. I hope this leads you in the right direction!
Upvotes: 2