A.J
A.J

Reputation: 726

Parsing inner td tag with Beautifulsoop4 in python

I am working on a small project, and I am having a hard time parsing the needed rows from an html code using bs4.

HTML:

 <div id="results_box">
 <table class="genTbl closedTbl historicalTbl" id="curr_table">
    <thead>
        <tr>
            <th class="first left noWrap">Date</th>
            <th class="noWrap">Price</th>
            <th class="noWrap">Open</th>
            <th class="noWrap">High</th>
            <th class="noWrap">Low</th>
            <th class="noWrap">Vol.</th>            <th class="noWrap">Change %</th>
        </tr>
    </thead>
    <tbody>
            <tr>
            <td class="first left bold noWrap">Jul 15, 2016</td>
            <td class="redFont">98.78</td>
            <td>99.02</td>
            <td>99.30</td>
            <td>98.51</td>
            <td>30.14M</td>            <td class="bold redFont">-0.01%</td>
        </tr>
                <tr>
            <td class="first left bold noWrap">Jul 14, 2016</td>
            <td class="greenFont">98.79</td>
            <td>97.39</td>
            <td>98.99</td>
            <td>97.32</td>
            <td>38.92M</td>            <td class="bold greenFont">1.98%</td>
         </tr> 

I need to extract -0.01% and 1.98% from these two lines

<td class="bold redFont">-0.01%</td>
<td class="bold greenFont">1.98%</td>

I used

txt = parsed_html.find("table", {"id":"curr_table"}).find_all("td", {"class":re.compile('bold .*Font')})
for row in txt:
  L.append(row.text)
print(L)

but I am getting an empty list. Any solutions or other suggestions ?

Upvotes: 3

Views: 1034

Answers (1)

alecxe
alecxe

Reputation: 474281

The reason your current approach does not work is that the class is a special multi-valued attribute in BeautifulSoup and a regular expression would not be applied to the complete attribute, but to individual classes instead, this thread should explain it in more detail:

You can actually avoid checking class values and, instead, just grab the td elements having % at the end of the text:

table = parsed_html.find("table", {"id":"curr_table"})
for td in table.find_all("td", text=lambda text: text and text.endswith('%')):
    print(td.get_text())

I would actually use pandas to parse this well-formatted table into the dataframe, which is quite convenient to work with. pandas provides an extensive documentation to help you understand how to work with a dataframe:

import pandas as pd

data = """
 <table class="genTbl closedTbl historicalTbl" id="curr_table">
    <thead>
        <tr>
            <th class="first left noWrap">Date</th>
            <th class="noWrap">Price</th>
            <th class="noWrap">Open</th>
            <th class="noWrap">High</th>
            <th class="noWrap">Low</th>
            <th class="noWrap">Vol.</th>            <th class="noWrap">Change %</th>
        </tr>
    </thead>
    <tbody>
            <tr>
            <td class="first left bold noWrap">Jul 15, 2016</td>
            <td class="redFont">98.78</td>
            <td>99.02</td>
            <td>99.30</td>
            <td>98.51</td>
            <td>30.14M</td>            <td class="bold redFont">-0.01%</td>
        </tr>
                <tr>
            <td class="first left bold noWrap">Jul 14, 2016</td>
            <td class="greenFont">98.79</td>
            <td>97.39</td>
            <td>98.99</td>
            <td>97.32</td>
            <td>38.92M</td>            <td class="bold greenFont">1.98%</td>
         </tr>
    </tbody>
</table>
"""

df = pd.read_html(data)[0]
print(df)

print("----")
print(df['Change %'].tolist())

Prints:

           Date  Price   Open   High    Low    Vol. Change %
0  Jul 15, 2016  98.78  99.02  99.30  98.51  30.14M   -0.01%
1  Jul 14, 2016  98.79  97.39  98.99  97.32  38.92M    1.98%
----
['-0.01%', '1.98%']

Upvotes: 2

Related Questions