Reputation: 500
I'm trying to get some info from this page
url = 'https://www.emsc-csem.org/Earthquake/'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
info = soup.findAll("tbody")
for elem in info:
print (elem)
From here, I would like to be able to take the info relative to an earthquake. So, for each row, I would like to iterate to some well-known indexes (td
in this case) and grab the content. Is there any way to do it inside this for loop, without making another soup
? I think the way I would like this to be done is something like elem.td[3]
(I know is wrong, but just to keep in mind).
Edit: I tried something like this:
from bs4 import BeautifulSoup
import requests
url = 'https://www.emsc-csem.org/Earthquake/'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
info = soup.find_all("tr")
i = 0
for tr in info:
date = tr.find("td", {"class": "tabev6"})
if date is not None:
print(date.text)
And now I'm unsure about why date
is None
in some cases, after looking at the html
. Each tr
is supposed to have a td
with class = tabev6
Upvotes: 0
Views: 70
Reputation: 84465
You could write a custom function to pull out the info from the tds in the table. In the example below, I extract the hyperlink, ensure lat values are in single column and split out times and dates into separate columns to allow for additional sorting.
from bs4 import BeautifulSoup as bs
import requests, re
import pandas as pd
def get_processed_row(sample_row:object)->list:
row = []
lat = lon = ''
for num, td in enumerate(sample_row.select('td')):
if num == 3:
link_node = td.select_one('a')
link = 'https://www.emsc-csem.org/' + link_node['href']
date, time = re.sub('(\\xa0)+',' ', link_node.text).split(' ')
ago = td.select_one('.ago').text
row.extend([link, date, time, ago])
elif num in [4, 5]:
lat+=td.text
elif num in [6, 7]:
lon+=td.text
elif num == 12:
update_date, update_time = td.text.split(' ')
row.extend([update_date, update_time])
else:
row.extend([td.text.strip()])
row.extend([lat, lon])
return row
soup = bs(requests.get('https://www.emsc-csem.org/Earthquake/').content, 'lxml')
rows = [get_processed_row(row) for row in soup.select('#tbody tr')]
df = pd.DataFrame(rows)
df.columns = ['Num_Comments', 'Num_Pictures', 'MacroseismicIntensity', 'Link', 'DateUTC', 'TimeUTC', 'Ago',
'Depth_km', 'MagType', 'Mag', 'Region', 'LastUpdateDate', 'LastUpdateTime', 'LatDegrees','LonDegrees']
df = df[['Num_Comments', 'Num_Pictures', 'MacroseismicIntensity', 'Link', 'DateUTC', 'TimeUTC', 'Ago',
'LatDegrees','LonDegrees', 'Depth_km', 'MagType', 'Mag', 'Region', 'LastUpdateDate', 'LastUpdateTime']]
print(df)
You could also do the whole thing with pandas but again there is some column tidying up to do as some data is split across different columns and there are some repeat headers to deal with. The following uses read_html
and indexing to grab the right table, then handles the aforementioned cases, as well as pulling out Date, Time and Ago into separate columns. There is some column re-ordering at the end, cleaning of NaNs, and then, at your request, a conversion to a list.
import pandas as pd
t = pd.read_html('https://www.emsc-csem.org/Earthquake/')[3]
t = t.iloc[:-2, :] #remove last row
t.columns = [i for i in range(len(t.columns))] # rename columns as some headers are repeated
t['LatDegrees'] = t[4] + t[5] #join cols
t['LonDegrees'] = t[6] + t[7] #join cols
t['DateUtc'] = t[3].apply(lambda x: str(x)[:10]) #subset for new col
t['TimeUtc'] = t[3].apply(lambda x: str(x)[12:21])
t['Ago'] = t[3].apply(lambda x: str(x)[22:])
t.drop(t.iloc[:,3:8], axis=1, inplace=True) #get rid of columns used to generate new cols
t.columns = ['Num_Comments', 'Num_Pictures', 'MacroseismicIntensity', 'Depth_km', 'MagType', 'Mag', 'Region',
'LastUpdateDateTime', 'LatDegrees','LonDegrees', 'DateUTC', 'TimeUTC', 'Ago'] #add headers
t = t[['Num_Comments', 'Num_Pictures', 'MacroseismicIntensity', 'LatDegrees','LonDegrees', 'Depth_km', 'MagType', 'Mag', 'Region',
'DateUTC', 'TimeUTC', 'Ago', 'LastUpdateDateTime']] #re-order columns
t['MagType'] = t['MagType'].str.upper() # ensure case consistency
t = t.dropna(how='all').fillna('') # remove nans
rows = t.values.tolist() # convert to requested list (list of lists)
print(rows)
#for row in rows:
#print(' '.join(row))
In both cases I dislike the reliance on positioning in source table remaining constant but that needs to be an assumption for the above. Header names may be as likely to change as column order, plus there is the repeated header issue.
Upvotes: 2
Reputation: 893
Just added text
and filter out the data you need.
NOTE you could do this adding another loop and iterate with elem.text.split("")
from bs4 import BeautifulSoup
import requests
url = 'https://www.emsc-csem.org/Earthquake/'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
info = soup.findAll("tbody")
for elem in info:
print (elem.text)
if you would like to see it in action visit my colab note book.
https://colab.research.google.com/drive/1uMosEG-owTdzixo-vaDRQW_aYfseLG8Q?usp=sharing
Upvotes: 1