How to scrape text from HTML to dataframe removing header and footer extra information?

I would like to extract focal mechanism information from the GCMT catalog (https://www.globalcmt.org/). In the future I plan on doing this in an automated way in python to extract earthquake information within python outside of the GCMT webpage for plotting/analysis.

Here's the code I have so far with an example URL:

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://www.globalcmt.org/cgi-bin/globalcmt-cgi-bin/CMT5/form?itype=ymd&yr=1976&mo=1&day=1&oyr=1976&omo=1&oday=1&jyr=1976&jday=1&ojyr=1976&ojday=1&otype=nd&nday=365&lmw=0&umw=10&lms=0&ums=10&lmb=0&umb=10&llat=-90&ulat=90&llon=-180&ulon=180&lhd=0&uhd=1000&lts=-9999&uts=9999&lpe1=0&upe1=90&lpe2=0&upe2=90&list=6"
r = requests.get(URL).text

page = requests.get(URL)
soup = BeautifulSoup(page.content, "html5lib")

text = soup.body.get_text(separator= '\n', strip=True)
print(text)

Global CMT Catalog
Search criteria:
Start date: 1976/1/1   End date: 1976/12/30
-90 <=lat<= 90          -180 <=lon<= 180 
0 <=depth<= 1000         -9999 <=time shift<= 9999
0 <=mb<= 10        0<=Ms<= 10           0<=Mw<= 10
0 <=tension plunge<= 90         0 <=null plunge<= 90
Results
Output in
GMT
psmeca (GMT v>3.3) format
Columns: lon lat depth mrr mtt mpp mrt mrp mtp iexp name
-176.96 -29.25 48 7.68 0.09 -7.77 1.39 4.52 -3.26 26 X Y 010176A        
-75.14 -13.42 85 -1.78 -0.59 2.37 -1.28 1.97 -2.90 24 X Y 010576A        
159.50 51.45 15 1.10 -0.30 -0.80 1.05 1.24 -0.56 25 X Y 010676A
...

I'm still new to python/webscraping but I would like to extract the data from containing (Columns: lon lat depth mrr mtt mpp mrt mrp mtp iexp name) excluding the footer information (End of events found with given criteria.) and beyond.

The output would contain column information: lon lat depth mrr mtt mpp mrt mrp mtp iexp name

Then the data (e.g.): -176.96 -29.25 48 7.68 0.09 -7.77 1.39 4.52 -3.26 26 X Y 010176A

Upvotes: 2

Answers (3)

DougR

Reputation: 3479

You can use the split command to divide the text into sections. I also cleaned up the table to remove the "X Y".

import requests
from bs4 import BeautifulSoup
import pandas as pd
import io

URL = "https://www.globalcmt.org/cgi-bin/globalcmt-cgi-bin/CMT5/form?itype=ymd&yr=1976&mo=1&day=1&oyr=1976&omo=1&oday=1&jyr=1976&jday=1&ojyr=1976&ojday=1&otype=nd&nday=365&lmw=0&umw=10&lms=0&ums=10&lmb=0&umb=10&llat=-90&ulat=90&llon=-180&ulon=180&lhd=0&uhd=1000&lts=-9999&uts=9999&lpe1=0&upe1=90&lpe2=0&upe2=90&list=6"
r = requests.get(URL).text
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
text = soup.body.get_text(separator= '\n', strip=True)

split_top, split_main = text.split('\nColumns: ')
split_main, split_end = split_main.split('\nEnd of events found with given criteria.')

# remove the "X Y "
split_main = split_main.replace(' X Y ',' ')

df = pd.read_csv(io.StringIO(split_main),sep='\s+')
df

Upvotes: 0

HedgeHog

Reputation: 25048

You could create a list of dicts from header and values:

header = soup.select_one('pre:nth-of-type(2)').find_previous(text=True).split()[1:]
header[10:10] = ['x','y']

for l in soup.select_one('pre:nth-of-type(2)').text.splitlines():
    d = l.split()
    #d[10:13] = [' '.join([str(x) for x in d[10:13]])]
    # del d[10:12]
    data.append(dict(zip(header,d)))

Tricky part in my opinion is that you have to handle the the last elements in your list to avoid missmatch to headers.

Assuming "X Y ..." belong together:

d[10:13] = [' '.join([str(x) for x in d[10:13]])]

or if they are not needed simply delete them:

del d[10:12]

or adjust the headers instead:

header[10:10] = ['x','y']

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://www.globalcmt.org/cgi-bin/globalcmt-cgi-bin/CMT5/form?itype=ymd&yr=1976&mo=1&day=1&oyr=1976&omo=1&oday=1&jyr=1976&jday=1&ojyr=1976&ojday=1&otype=nd&nday=365&lmw=0&umw=10&lms=0&ums=10&lmb=0&umb=10&llat=-90&ulat=90&llon=-180&ulon=180&lhd=0&uhd=1000&lts=-9999&uts=9999&lpe1=0&upe1=90&lpe2=0&upe2=90&list=6"
r = requests.get(URL).text

page = requests.get(URL)
soup = BeautifulSoup(page.content, "html5lib")

data = []

header = soup.select_one('pre:nth-of-type(2)').find_previous(text=True).split()[1:]
header[10:10] = ['x','y']

for l in soup.select_one('pre:nth-of-type(2)').text.splitlines():
    d = l.split()
    #d[10:13] = [' '.join([str(x) for x in d[10:13]])]
    # del d[10:12]
    data.append(dict(zip(header,d)))

pd.DataFrame(data)

Output

	lon	lat	depth	mrr	mtt	mpp	mrt	mrp	mtp	iexp	x	y	name
0	-176.96	-29.25	48	7.68	0.09	-7.77	1.39	4.52	-3.26	26	X	Y	010176A
1	-75.14	-13.42	85	-1.78	-0.59	2.37	-1.28	1.97	-2.9	24	X	Y	010576A
2	159.5	51.45	15	1.1	-0.3	-0.8	1.05	1.24	-0.56	25	X	Y	010676A
3	167.81	-15.97	174	-1.7	2.29	-0.59	-2.33	-1.23	2.01	25	X	Y	010976A
4	-16.29	66.33	15	-0.51	-2.86	3.37	0.05	-0.78	-0.86	25	X	Y	011376A
5	-177.04	-29.69	47	4.78	-0.49	-4.3	0.83	3.62	-1.32	27	X	Y	011476A
6	-176.75	-28.72	18	2.56	0.18	-2.74	3.58	6.77	-1.23	27	X	Y	011476B
7	-176.62	-28.61	15	2.34	0.24	-2.58	0.62	3.71	-0.68	25	X	Y	011476C
8	-176.63	-30.25	15	1.44	0.06	-1.5	0.3	1.18	-0.46	25	X	Y	011576A

...

Upvotes: 2

PlainRavioli

Reputation: 1211

Assuming the data part will always start with "Columns:"

match = re.finditer(r"Columns\:", text) #find the data part with regex
index_data = next(match).start(0) #take the position of the found regex regex

lines = [x.strip().split(" ") for x in text[len("Columns\:"):].split("\n")] #split the lines with line breaks, and split each line with space
columns = lines[0] #first line is the column names
values = lines[1:] # other are the values

print(columns)
print(values)

output:

['lon', 'lat', 'depth', 'mrr', 'mtt', 'mpp', 'mrt', 'mrp', 'mtp', 'iexp', 'name']
[['-176.96', '-29.25', '48', '7.68', '0.09', '-7.77', '1.39', '4.52', '-3.26', '26', 'X', 'Y', '010176A'], ['-75.14', '-13.42', '85', '-1.78', '-0.59', '2.37', '-1.28', '1.97', '-2.90', '24', 'X', 'Y', '010576A'], ['159.50', '51.45', '15', '1.10', '-0.30', '-0.80', '1.05', '1.24', '-0.56', '25', 'X', 'Y', '010676A']]

Hope this help !

Upvotes: 0

How to scrape text from HTML to dataframe removing header and footer extra information?

Answers (3)

Example

Output

Related Questions