Reputation: 471
I want to scrape data from a table, taking whole row <tr>
if there is a <td BGCOLOR="#D42A2A">
in the row
The html looks like this (there are more than 2 rows):
<tr bgcolor="#f4f4f4">
<td height="25" nowrap="NOWRAP"> ITEM_1 </td>
<td height="25" nowrap="NOWRAP"> 07:58:46 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 07:58:00 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 01:25:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP"> ITEM_2 </td>
<td height="25" nowrap="NOWRAP"> 07:58:46 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 191 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 07:58:01 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 01:25:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP"> ITEM_3 </td>
<td height="25" nowrap="NOWRAP"> 07:59:02 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 36 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 36 </td>
<td height="25" nowrap="NOWRAP"> 07:58:01 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#d42a2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 03:10:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
I have used this but the answer here gives all rows in a table instead of rows that contain the necessary attribute
So my code so far looks like:
data = []
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
then I scrape the site again to find the bgcolor attribute, add it to list, append list to frame and drop any row that doesn't have the correct bgcolor.
This is all quite inefficient
How can I scrape the html to take rows from the table only if bgcolor exists in td.attrs of the row
EDIT: Once the solutions below are applied to the entire html, the script returns empty lists (and that is my fault for not including more html). This html below is more complete version where more tags are included.
<html><head><title></title><style type="text/css">
BODY {
font-family: Tahoma, Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: 11px;
background-color: #FFFFFF
;}TABLE {
font-family: Tahoma, Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: 11px;
background-color: #FFFFFF;}
DIV.boldText {
font-size: 11px;font-weight: bold;
}
</style>
<meta http-equiv="REFRESH" content="10">
</head><body>
<form name="DataViewChooser">
<hr width="95%" align="CENTER" color="#55aa2a">
<table width="95%" align="CENTER">
<tbody><tr><td width="40" height="65" title="(c) ITEMS"><img
src="/icons/geneos_logo.png"></td>
<td width="25" align="LEFT">
<img title="Refresh" style="cursor: hand;" onclick="reloadPage()"
src="/icons/refresh.png"></td>
<td width="25" title="Show Fail and Warning Only" align="LEFT"><img
style="cursor: hand;" onclick="userContractView()" src="/icons/minimise.png"></td>
<td width="25" align="LEFT"><img title="Home" style="cursor: hand;" onclick="goHome()" src="/icons/up.png"></td>
<td align="RIGHT" nowrap="NOWRAP"><img src="/icons/hostgreen.gif">
<div class="boldText"> DASHBOARD-CV_AMER_Dashboard</div> [GROUP]
</td>
</tr></tbody></table><hr width="95%" align="CENTER" color="#55aa2a"></form>
<br><table width="95%" align="CENTER"><tbody><tr><td><table>
<tbody><tr><th height="20" align="LEFT" nowrap="NOWRAP"> AMER
</th>
<td nowrap="NOWRAP" bgcolor="#55aa2a"> </td></tr>
</tbody></table></td></tr></tbody></table>
<br><table width="99%" align="CENTER">
<tbody><tr bgcolor="#c0c0c0">
<th height="20" align="LEFT" nowrap="NOWRAP"> RowName </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> Gateway_updatetime
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> Gateway_state </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> OrdersCleared </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> Ticketsread </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> OrdersNotCleared
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> TicketsNotCleared
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> LastReadingtime
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> LastClearingtime
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> ClearingInProgress
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> YestVolumes </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> Starttime </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> Stoptime </th>
</tr><tr bgcolor="#f4f4f4">
<td height="25" nowrap="NOWRAP"> ITEM_4 </td>
<td height="25" nowrap="NOWRAP"> 07:58:46 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 07:58:00 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#d42a2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 01:25:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP"> ITEM_5 </td>
<td height="25" nowrap="NOWRAP"> 07:58:46 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 191 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 07:58:01 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 01:25:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
</tbody></table><script language="JavaScript" src="/cookie.js"></script>
</body></html>'''
It is also worth noting that I am using urllib.request to open the url then parsing with BS
Upvotes: 1
Views: 317
Reputation: 71451
You can use any
:
from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser')
results = [i for i in d.find_all('tr') if any(c.attrs.get('bgcolor') == "#d42a2a" for c in i.find_all('td'))]
Output:
[<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP"> ITEM_3 </td>
<td height="25" nowrap="NOWRAP"> 07:59:02 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 36 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 36 </td>
<td height="25" nowrap="NOWRAP"> 07:58:01 </td>
<td bgcolor="#d42a2a" height="25" nowrap="NOWRAP"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 03:10:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>]
Upvotes: 1
Reputation: 19154
find all td
contains bgcolor="#d42a2a"
then select the .parent
cells = table_body.find_all('td', bgcolor="#d42a2a")
for cell in cells:
print(cell.parent)
# <tr>...<td bgcolor="#d42a2a">...</tr>
Upvotes: 0
Reputation: 473873
You can apply a searching function where you can check the name of the tag to be tr
as well as check that the row contains a td
element with bgcolor="#D42A2A"
:
def rows_with_desired_bgcolor(elm):
return elm.name == 'tr' and elm.find('td', bgcolor="#D42A2A")
table_body.find_all(rows_with_desired_bgcolor)
You could, of course, do the same check in a list comprehension directly:
[tr for tr in table_body('tr') if tr.find('td', bgcolor="#D42A2A")]
where table_body('tr')
is a shortcut to table_body.find_all('tr')
.
Upvotes: 1