Reputation: 2043
I have a pretty large html file. I need to scrape this html file and extract certain information
soup.findAll('table',{"summary" : "This table displays snapshot information"})
[<table border="1" summary="This table displays snapshot information" width="500">
<tbody><tr><th class="awrnobg" scope="col"></th><th class="awrbg" scope="col">Snap Id</th><th class="awrbg" scope="col">Snap Time</th><th class="awrbg" scope="col">Sessions</th><th class="awrbg" scope="col">Cursors/Session</th></tr>
<tr><td class="awrnc" scope="row">Begin Snap:</td><td align="right" class="awrnc">98810</td><td align="center" class="awrnc">29-Jun-15 08:00:02</td><td align="right" class="awrnc">700</td><td align="right" class="awrnc"> 129.6</td></tr>
<tr><td class="awrc" scope="row">End Snap:</td><td align="right" class="awrc">98864</td><td align="center" class="awrc">29-Jun-15 17:00:23</td><td align="right" class="awrc">703</td><td align="right" class="awrc"> 129.1</td></tr>
<tr><td class="awrnc" scope="row">Elapsed:</td><td class="awrnc"> </td><td align="center" class="awrnc"> 540.35 (mins)</td><td class="awrnc"> </td><td class="awrnc"> </td></tr>
<tr><td class="awrc" scope="row">DB Time:</td><td class="awrc"> </td><td align="center" class="awrc"> 2,963.17 (mins)</td><td class="awrc"> </td><td class="awrc"> </td></tr>
</tbody></table>]
Using beautiful soup i managed to get a list. But i need to extract the date 29-Jun-15 08:00:02 any ideas I can manipulate the list items individually but that looks ugly to me.
Upvotes: 1
Views: 971
Reputation: 14169
Just search for the td
using its class
. It should return a list and you can proceed from there.
from bs4 import BeautifulSoup as bsoup
html = """<table border="1" summary="This table displays snapshot information" width="500">
<tbody><tr><th class="awrnobg" scope="col"></th><th class="awrbg" scope="col">Snap Id</th><th class="awrbg" scope="col">Snap Time</th><th class="awrbg" scope="col">Sessions</th><th class="awrbg" scope="col">Cursors/Session</th></tr>
<tr><td class="awrnc" scope="row">Begin Snap:</td><td align="right" class="awrnc">98810</td><td align="center" class="awrnc">29-Jun-15 08:00:02</td><td align="right" class="awrnc">700</td><td align="right" class="awrnc"> 129.6</td></tr>
<tr><td class="awrc" scope="row">End Snap:</td><td align="right" class="awrc">98864</td><td align="center" class="awrc">29-Jun-15 17:00:23</td><td align="right" class="awrc">703</td><td align="right" class="awrc"> 129.1</td></tr>
<tr><td class="awrnc" scope="row">Elapsed:</td><td class="awrnc"> </td><td align="center" class="awrnc"> 540.35 (mins)</td><td class="awrnc"> </td><td class="awrnc"> </td></tr>
<tr><td class="awrc" scope="row">DB Time:</td><td class="awrc"> </td><td align="center" class="awrc"> 2,963.17 (mins)</td><td class="awrc"> </td><td class="awrc"> </td></tr>
</tbody></table>"""
soup = bsoup(html)
print soup.find_all('td', class_='awrnc')[2].get_text()
# 29-Jun-15 08:00:02
EDIT:
Taking into account your original code that returns a list of tables, just use normal list indexing/slicing to get what the table you want. See my following example. I changed the above HTML to have three table
s that have the same summary
attribute. My code will return all three, so I'll select the first one. Then, I'll look for all the td
s that match my defined class
. I'll then choose the third one using [2]
. Then, I'll use get_text()
to get the value inside the target td
element.
from bs4 import BeautifulSoup as bsoup
html = """<html><body><table border="1" summary="This table displays snapshot information" width="500">
<tbody><tr><th class="awrnobg" scope="col"></th><th class="awrbg" scope="col">Snap Id</th><th class="awrbg" scope="col">Snap Time</th><th class="awrbg" scope="col">Sessions</th><th class="awrbg" scope="col">Cursors/Session</th></tr>
<tr><td class="awrnc" scope="row">Begin Snap:</td><td align="right" class="awrnc">98810</td><td align="center" class="awrnc">29-Jun-15 08:00:02</td><td align="right" class="awrnc">700</td><td align="right" class="awrnc"> 129.6</td></tr>
<tr><td class="awrc" scope="row">End Snap:</td><td align="right" class="awrc">98864</td><td align="center" class="awrc">29-Jun-15 17:00:23</td><td align="right" class="awrc">703</td><td align="right" class="awrc"> 129.1</td></tr>
<tr><td class="awrnc" scope="row">Elapsed:</td><td class="awrnc"> </td><td align="center" class="awrnc"> 540.35 (mins)</td><td class="awrnc"> </td><td class="awrnc"> </td></tr>
<tr><td class="awrc" scope="row">DB Time:</td><td class="awrc"> </td><td align="center" class="awrc"> 2,963.17 (mins)</td><td class="awrc"> </td><td class="awrc"> </td></tr>
</tbody></table><table summary="This table displays snapshot information"></table><table summary="This table displays snapshot information"></table><body><html>"""
soup = bsoup(html)
list_of_tables = soup.find_all("table", {"summary":"This table displays snapshot information"}) # This will return 3 tables based on the above HTML.
target_table = list_of_tables[0] # Target the first one.
list_of_tds = target_table.find_all('td', class_='awrnc')
target_td = list_of_tds[2]
target_value = target_td.get_text()
print target_value
# 29-Jun-15 08:00:02
TL;DR: Just use [0]
on your list. Seems like it's the only table you find anyway. After that, you can search inside it again, as it becomes a valid BeautifulSoup
HTML string.
Upvotes: 5