swagless_monk
swagless_monk

Reputation: 471

Scrape <tr> tag if <td> tag has attribute

I want to scrape data from a table, taking whole row <tr> if there is a <td BGCOLOR="#D42A2A"> in the row

The html looks like this (there are more than 2 rows):

<tr bgcolor="#f4f4f4">
<td height="25" nowrap="NOWRAP">&nbsp;ITEM_1&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:46&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;Connected&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:00&nbsp;</td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a">&nbsp;--:--:--&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;01:25:00 &nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp; 22:00:00&nbsp;</td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP">&nbsp;ITEM_2&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:46&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;Connected&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;191&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:01&nbsp;</td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a">&nbsp;--:--:--&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;01:25:00 &nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp; 22:00:00&nbsp;</td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP">&nbsp;ITEM_3&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:59:02&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;Connected&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;36&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;36&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:01&nbsp;</td>
<td height="25" nowrap="NOWRAP" bgcolor="#d42a2a">&nbsp;--:--:--&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;03:10:00 &nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp; 22:00:00&nbsp;</td>
</tr>

I have used this but the answer here gives all rows in a table instead of rows that contain the necessary attribute

So my code so far looks like:

data = []

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

then I scrape the site again to find the bgcolor attribute, add it to list, append list to frame and drop any row that doesn't have the correct bgcolor.

This is all quite inefficient

How can I scrape the html to take rows from the table only if bgcolor exists in td.attrs of the row

EDIT: Once the solutions below are applied to the entire html, the script returns empty lists (and that is my fault for not including more html). This html below is more complete version where more tags are included.

<html><head><title></title><style type="text/css">
BODY {
font-family: Tahoma, Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: 11px;
background-color: #FFFFFF
;}TABLE {
font-family: Tahoma, Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: 11px;
background-color: #FFFFFF;}
DIV.boldText {
font-size: 11px;font-weight: bold;
}
</style>
<meta http-equiv="REFRESH" content="10">
</head><body>
<form name="DataViewChooser">
<hr width="95%" align="CENTER" color="#55aa2a">
<table width="95%" align="CENTER">
<tbody><tr><td width="40" height="65" title="(c) ITEMS"><img 
src="/icons/geneos_logo.png"></td>
<td width="25" align="LEFT">
<img title="Refresh" style="cursor: hand;" onclick="reloadPage()" 
src="/icons/refresh.png"></td>
<td width="25" title="Show Fail and Warning Only" align="LEFT"><img 
style="cursor: hand;" onclick="userContractView()" src="/icons/minimise.png"></td>
<td width="25" align="LEFT"><img title="Home" style="cursor: hand;" onclick="goHome()" src="/icons/up.png"></td>
<td align="RIGHT" nowrap="NOWRAP"><img src="/icons/hostgreen.gif">
<div class="boldText">&nbsp;DASHBOARD-CV_AMER_Dashboard</div>&nbsp; [GROUP]
</td>
</tr></tbody></table><hr width="95%" align="CENTER" color="#55aa2a"></form>
<br><table width="95%" align="CENTER"><tbody><tr><td><table>
<tbody><tr><th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;AMER&nbsp; 
</th>
<td nowrap="NOWRAP" bgcolor="#55aa2a">&nbsp;&nbsp;</td></tr>
</tbody></table></td></tr></tbody></table>
<br><table width="99%" align="CENTER">
<tbody><tr bgcolor="#c0c0c0">
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;RowName&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;Gateway_updatetime&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;Gateway_state&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;OrdersCleared&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;Ticketsread&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;OrdersNotCleared&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;TicketsNotCleared&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;LastReadingtime&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;LastClearingtime&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;ClearingInProgress&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;YestVolumes&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;Starttime&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;Stoptime&nbsp;</th>
</tr><tr bgcolor="#f4f4f4">
<td height="25" nowrap="NOWRAP">&nbsp;ITEM_4&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:46&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;Connected&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:00&nbsp;</td>
<td height="25" nowrap="NOWRAP" bgcolor="#d42a2a">&nbsp;--:--:--&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;01:25:00 &nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp; 22:00:00&nbsp;</td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP">&nbsp;ITEM_5&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:46&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;Connected&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;191&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:01&nbsp;</td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a">&nbsp;--:--:--&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;01:25:00 &nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp; 22:00:00&nbsp;</td>
</tr>
</tbody></table><script language="JavaScript" src="/cookie.js"></script>
</body></html>'''

It is also worth noting that I am using urllib.request to open the url then parsing with BS

Upvotes: 1

Views: 317

Answers (3)

Ajax1234
Ajax1234

Reputation: 71451

You can use any:

from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser')
results = [i for i in d.find_all('tr') if any(c.attrs.get('bgcolor') == "#d42a2a" for c in i.find_all('td'))]

Output:

[<tr bgcolor="#ffffff">
  <td height="25" nowrap="NOWRAP"> ITEM_3 </td>
  <td height="25" nowrap="NOWRAP"> 07:59:02 </td>
  <td height="25" nowrap="NOWRAP"> Connected </td>
  <td height="25" nowrap="NOWRAP"> 0 </td>
  <td height="25" nowrap="NOWRAP"> 36 </td>
  <td height="25" nowrap="NOWRAP"> 0 </td>
  <td height="25" nowrap="NOWRAP"> 36 </td>
  <td height="25" nowrap="NOWRAP"> 07:58:01 </td>
  <td bgcolor="#d42a2a" height="25" nowrap="NOWRAP"> --:--:-- </td>
  <td height="25" nowrap="NOWRAP"> 0 </td>
  <td height="25" nowrap="NOWRAP"> 0 </td>
  <td height="25" nowrap="NOWRAP"> 03:10:00  </td>
  <td height="25" nowrap="NOWRAP">  22:00:00 </td>
 </tr>]

Upvotes: 1

ewwink
ewwink

Reputation: 19154

find all td contains bgcolor="#d42a2a" then select the .parent

cells = table_body.find_all('td', bgcolor="#d42a2a")
for cell in cells:
    print(cell.parent) 
    # <tr>...<td bgcolor="#d42a2a">...</tr>

Upvotes: 0

alecxe
alecxe

Reputation: 473873

You can apply a searching function where you can check the name of the tag to be tr as well as check that the row contains a td element with bgcolor="#D42A2A":

def rows_with_desired_bgcolor(elm):
    return elm.name == 'tr' and elm.find('td', bgcolor="#D42A2A")

table_body.find_all(rows_with_desired_bgcolor)

You could, of course, do the same check in a list comprehension directly:

[tr for tr in table_body('tr') if tr.find('td', bgcolor="#D42A2A")]

where table_body('tr') is a shortcut to table_body.find_all('tr').

Upvotes: 1

Related Questions