Reputation: 726
I am working on a small project, and I am having a hard time parsing the needed rows from an html code using bs4.
HTML:
<div id="results_box">
<table class="genTbl closedTbl historicalTbl" id="curr_table">
<thead>
<tr>
<th class="first left noWrap">Date</th>
<th class="noWrap">Price</th>
<th class="noWrap">Open</th>
<th class="noWrap">High</th>
<th class="noWrap">Low</th>
<th class="noWrap">Vol.</th> <th class="noWrap">Change %</th>
</tr>
</thead>
<tbody>
<tr>
<td class="first left bold noWrap">Jul 15, 2016</td>
<td class="redFont">98.78</td>
<td>99.02</td>
<td>99.30</td>
<td>98.51</td>
<td>30.14M</td> <td class="bold redFont">-0.01%</td>
</tr>
<tr>
<td class="first left bold noWrap">Jul 14, 2016</td>
<td class="greenFont">98.79</td>
<td>97.39</td>
<td>98.99</td>
<td>97.32</td>
<td>38.92M</td> <td class="bold greenFont">1.98%</td>
</tr>
I need to extract -0.01% and 1.98% from these two lines
<td class="bold redFont">-0.01%</td>
<td class="bold greenFont">1.98%</td>
I used
txt = parsed_html.find("table", {"id":"curr_table"}).find_all("td", {"class":re.compile('bold .*Font')})
for row in txt:
L.append(row.text)
print(L)
but I am getting an empty list. Any solutions or other suggestions ?
Upvotes: 3
Views: 1034
Reputation: 474281
The reason your current approach does not work is that the class
is a special multi-valued attribute in BeautifulSoup
and a regular expression would not be applied to the complete attribute, but to individual classes instead, this thread should explain it in more detail:
You can actually avoid checking class values and, instead, just grab the td
elements having %
at the end of the text:
table = parsed_html.find("table", {"id":"curr_table"})
for td in table.find_all("td", text=lambda text: text and text.endswith('%')):
print(td.get_text())
I would actually use pandas
to parse this well-formatted table into the dataframe, which is quite convenient to work with. pandas
provides an extensive documentation to help you understand how to work with a dataframe:
import pandas as pd
data = """
<table class="genTbl closedTbl historicalTbl" id="curr_table">
<thead>
<tr>
<th class="first left noWrap">Date</th>
<th class="noWrap">Price</th>
<th class="noWrap">Open</th>
<th class="noWrap">High</th>
<th class="noWrap">Low</th>
<th class="noWrap">Vol.</th> <th class="noWrap">Change %</th>
</tr>
</thead>
<tbody>
<tr>
<td class="first left bold noWrap">Jul 15, 2016</td>
<td class="redFont">98.78</td>
<td>99.02</td>
<td>99.30</td>
<td>98.51</td>
<td>30.14M</td> <td class="bold redFont">-0.01%</td>
</tr>
<tr>
<td class="first left bold noWrap">Jul 14, 2016</td>
<td class="greenFont">98.79</td>
<td>97.39</td>
<td>98.99</td>
<td>97.32</td>
<td>38.92M</td> <td class="bold greenFont">1.98%</td>
</tr>
</tbody>
</table>
"""
df = pd.read_html(data)[0]
print(df)
print("----")
print(df['Change %'].tolist())
Prints:
Date Price Open High Low Vol. Change %
0 Jul 15, 2016 98.78 99.02 99.30 98.51 30.14M -0.01%
1 Jul 14, 2016 98.79 97.39 98.99 97.32 38.92M 1.98%
----
['-0.01%', '1.98%']
Upvotes: 2