Reputation: 500

Scraping childs of a tag with Bs4

I'm trying to get some info from this page

url = 'https://www.emsc-csem.org/Earthquake/'

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

info = soup.findAll("tbody")
for elem in info:
    print (elem)

From here, I would like to be able to take the info relative to an earthquake. So, for each row, I would like to iterate to some well-known indexes (td in this case) and grab the content. Is there any way to do it inside this for loop, without making another soup? I think the way I would like this to be done is something like elem.td[3] (I know is wrong, but just to keep in mind).

Edit: I tried something like this:

from bs4 import BeautifulSoup
import requests

url = 'https://www.emsc-csem.org/Earthquake/'

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

info = soup.find_all("tr")
i = 0
for tr in info:
    date = tr.find("td", {"class": "tabev6"})
    if date is not None:
        print(date.text)

And now I'm unsure about why date is None in some cases, after looking at the html. Each tr is supposed to have a td with class = tabev6

Upvotes: 0

Answers (2)

QHarr

Reputation: 84465

You could write a custom function to pull out the info from the tds in the table. In the example below, I extract the hyperlink, ensure lat values are in single column and split out times and dates into separate columns to allow for additional sorting.

from bs4 import BeautifulSoup as bs
import requests, re
import pandas as pd

def get_processed_row(sample_row:object)->list:
    row = []
    lat = lon = ''
    
    for num, td in enumerate(sample_row.select('td')):
        if num == 3:
            link_node = td.select_one('a')
            link = 'https://www.emsc-csem.org/' + link_node['href']
            date, time = re.sub('(\\xa0)+',' ', link_node.text).split(' ')
            ago = td.select_one('.ago').text
            row.extend([link, date, time, ago])
        elif num in [4, 5]:
            lat+=td.text 
        elif num in [6, 7]:
            lon+=td.text
        elif num == 12:
            update_date, update_time = td.text.split(' ')
            row.extend([update_date, update_time])
        else:
            row.extend([td.text.strip()])
    row.extend([lat, lon])
    return row

soup = bs(requests.get('https://www.emsc-csem.org/Earthquake/').content, 'lxml')
rows = [get_processed_row(row) for row in soup.select('#tbody tr')]
df = pd.DataFrame(rows)

df.columns = ['Num_Comments', 'Num_Pictures', 'MacroseismicIntensity', 'Link', 'DateUTC', 'TimeUTC', 'Ago',
              'Depth_km', 'MagType', 'Mag', 'Region', 'LastUpdateDate', 'LastUpdateTime', 'LatDegrees','LonDegrees']

df = df[['Num_Comments', 'Num_Pictures', 'MacroseismicIntensity', 'Link', 'DateUTC', 'TimeUTC', 'Ago',
         'LatDegrees','LonDegrees', 'Depth_km', 'MagType', 'Mag', 'Region', 'LastUpdateDate', 'LastUpdateTime']]

print(df)

You could also do the whole thing with pandas but again there is some column tidying up to do as some data is split across different columns and there are some repeat headers to deal with. The following uses read_html and indexing to grab the right table, then handles the aforementioned cases, as well as pulling out Date, Time and Ago into separate columns. There is some column re-ordering at the end, cleaning of NaNs, and then, at your request, a conversion to a list.

import pandas as pd

t = pd.read_html('https://www.emsc-csem.org/Earthquake/')[3]
t = t.iloc[:-2, :] #remove last row
t.columns = [i for i in range(len(t.columns))] # rename columns as some headers are repeated
t['LatDegrees'] = t[4] + t[5] #join cols 
t['LonDegrees'] = t[6] + t[7] #join cols
t['DateUtc'] = t[3].apply(lambda x: str(x)[:10]) #subset for new col
t['TimeUtc'] = t[3].apply(lambda x: str(x)[12:21])
t['Ago'] = t[3].apply(lambda x: str(x)[22:])
t.drop(t.iloc[:,3:8], axis=1, inplace=True) #get rid of columns used to generate new cols

t.columns = ['Num_Comments', 'Num_Pictures', 'MacroseismicIntensity', 'Depth_km', 'MagType', 'Mag', 'Region', 
             'LastUpdateDateTime', 'LatDegrees','LonDegrees', 'DateUTC', 'TimeUTC', 'Ago'] #add headers

t = t[['Num_Comments', 'Num_Pictures', 'MacroseismicIntensity', 'LatDegrees','LonDegrees',  'Depth_km', 'MagType', 'Mag', 'Region', 
       'DateUTC', 'TimeUTC', 'Ago', 'LastUpdateDateTime']] #re-order columns

t['MagType'] = t['MagType'].str.upper() # ensure case consistency
t = t.dropna(how='all').fillna('') # remove nans
rows = t.values.tolist() # convert to requested list (list of lists)
print(rows)
#for row in rows:
    #print(' '.join(row))

In both cases I dislike the reliance on positioning in source table remaining constant but that needs to be an assumption for the above. Header names may be as likely to change as column order, plus there is the repeated header issue.

Upvotes: 2

TheoNeUpKID

Reputation: 893

Just added text and filter out the data you need. NOTE you could do this adding another loop and iterate with elem.text.split("")

from bs4 import BeautifulSoup
import requests

url = 'https://www.emsc-csem.org/Earthquake/'

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

info = soup.findAll("tbody")
for elem in info:
  print (elem.text)

if you would like to see it in action visit my colab note book.

https://colab.research.google.com/drive/1uMosEG-owTdzixo-vaDRQW_aYfseLG8Q?usp=sharing

Upvotes: 1

Scraping childs of a tag with Bs4

Answers (2)

Related Questions