Mark K
Mark K

Reputation: 9348

Python, using BeautifulSoup parsing values from a table

I am parsing a table in saved .html document, which looks like:

enter image description here

the html codes are like:

<table id="detailBody" width="100%" cellspacing="0" cellpadding="0" border="0" class="tab2" style="display: block;"><tbody>
                                        <tr><td><ul><li><span>15:00:19</span><span class="red">11.750</span><span class="red">5392</span><span class="fr red">↑</span></li><li><span>14:56:55</span><span class="red">11.750</span><span class="red">17</span><span class="fr red">↑</span></li><li><span>14:56:52</span><span class="red">11.750</span><span class="red">479</span><span class="fr red">↑</span></li><li><span>14:56:49</span><span class="">11.740</span><span class="green">6</span><span class="fr green">↓</span></li><li><span>14:56:46</span><span class="">11.740</span><span class="green">333</span><span class="fr green">↓</span></li><li><span>14:56:43</span><span class="">11.740</span><span class="green">21</span><span class="fr green">↓</span></li><li><span>14:56:40</span><span class="">11.740</span><span class="green">15</span><span class="fr green">↓</span></li><li><span>14:56:37</span><span class="">11.740</span><span class="green">35</span><span class="fr green">↓</span></li><li><span>14:56:34</span><span class="red">11.750</span><span class="red">11</span><span class="fr red">↑</span></li><li><span>14:56:31</span><span class="">11.740</span><span class="green">3</span><span class="fr green">↓</span></li><li><span>14:56:28</span><span class="">11.740</span><span class="green">24</span><span class="fr green">↓</span></li><li><span>14:56:22</span><span class="red">11.750</span><span class="red">291</span><span class="fr red">↑</span></li><li><span>14:56:19</span><span class="">11.740</span><span class="red">198</span><span class="fr red">↑</span></li><li><span>14:56:16</span><span class="green">11.730</span><span class="green">15</span><span class="fr green">↓</span></li></ul></td></tr>
                                    </tbody></table>

What I have so far is:

list_a = soup.find_all('table')[0].tbody.find_all("tr")

for a in list_a:
    for b in a:
        for c in b:
            for d in c:
                for e in d:
                    print e.renderContents()

even though it doesn't looked very nice, the result is like:

15:00:19
11.750
5392
↑
14:56:55
11.750
17
↑
14:56:52
11.750
479
↑

However there are too many contents in the table, I only want the first 10 groups of data in the table. And only the 3rd and 4th items to be put in 2 lists.

i.e.

[“5392”, “17”, “479”, …] 

and

[“↑”, “↑”, “↑”, …] #the “↑” can be changed to something else identical if it's a problem

how can I achieve that? Thanks.

Upvotes: 0

Views: 83

Answers (2)

Martin Evans
Martin Evans

Reputation: 46759

The following will extract your two columns using the span tag inside the li elements:

html = """
<table id="detailBody" width="100%" cellspacing="0" cellpadding="0" border="0" class="tab2" style="display: block;">
<tbody>
<tr>
    <td>
    <ul>
    <li><span>15:00:19</span><span class="red">11.750</span><span class="red">5392</span><span class="fr red">?</span></li>
    <li><span>14:56:55</span><span class="red">11.750</span><span class="red">17</span><span class="fr red">?</span></li>
    <li><span>14:56:52</span><span class="red">11.750</span><span class="red">479</span><span class="fr red">?</span></li>
    <li><span>14:56:49</span><span class="">11.740</span><span class="green">6</span><span class="fr green">?</span></li>
    <li><span>14:56:46</span><span class="">11.740</span><span class="green">333</span><span class="fr green">?</span></li>
    <li><span>14:56:43</span><span class="">11.740</span><span class="green">21</span><span class="fr green">?</span></li>
    <li><span>14:56:40</span><span class="">11.740</span><span class="green">15</span><span class="fr green">?</span></li>
    <li><span>14:56:37</span><span class="">11.740</span><span class="green">35</span><span class="fr green">?</span></li>
    <li><span>14:56:34</span><span class="red">11.750</span><span class="red">11</span><span class="fr red">?</span></li>
    <li><span>14:56:31</span><span class="">11.740</span><span class="green">3</span><span class="fr green">?</span></li>
    <li><span>14:56:28</span><span class="">11.740</span><span class="green">24</span><span class="fr green">?</span></li>
    <li><span>14:56:22</span><span class="red">11.750</span><span class="red">291</span><span class="fr red">?</span></li>
    <li><span>14:56:19</span><span class="">11.740</span><span class="red">198</span><span class="fr red">?</span></li>
    <li><span>14:56:16</span><span class="green">11.730</span><span class="green">15</span><span class="fr green">?</span></li>
    </ul>
    </td>
</tr>
</tbody></table>"""

soup = BeautifulSoup(html)

col_3 = []
col_4 = []

for li in soup.find_all('table')[0].find_all("li"):
    cols = li.find_all("span")
    col_3.append(cols[2].text)
    col_4.append(cols[3].text)

print col_3 
print col_4

This would give you the following output:

[u'5392', u'17', u'479', u'6', u'333', u'21', u'15', u'35', u'11', u'3', u'24', u'291', u'198', u'15']
[u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?']

Upvotes: 1

nablahero
nablahero

Reputation: 143

Why didn't you tried to find all span items directly because that is what you actually want or not? So instead of

list_a = soup.find_all('table')[0].tbody.find_all("tr")

try

list_a = soup.find_all('table')[0].tbody.find_all("tr")[0].find_all("span")

I don't know if you're table only has one row. If yes this shoudl work and give you all the spans and you just skip the one's you do not need. If you got multiple rows you have to iterate over the rows like this

list_a = soup.find_all('table')[0].tbody.find_all("tr")
for a in list_a:
    a.find_all("span")

and again you will get all span items. I hope this leads you in the right direction!

Upvotes: 2

Related Questions