joegodzila
joegodzila

Reputation: 33

How to scrape text from HTML to dataframe removing header and footer extra information?

I would like to extract focal mechanism information from the GCMT catalog (https://www.globalcmt.org/). In the future I plan on doing this in an automated way in python to extract earthquake information within python outside of the GCMT webpage for plotting/analysis.

Here's the code I have so far with an example URL:

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://www.globalcmt.org/cgi-bin/globalcmt-cgi-bin/CMT5/form?itype=ymd&yr=1976&mo=1&day=1&oyr=1976&omo=1&oday=1&jyr=1976&jday=1&ojyr=1976&ojday=1&otype=nd&nday=365&lmw=0&umw=10&lms=0&ums=10&lmb=0&umb=10&llat=-90&ulat=90&llon=-180&ulon=180&lhd=0&uhd=1000&lts=-9999&uts=9999&lpe1=0&upe1=90&lpe2=0&upe2=90&list=6"
r = requests.get(URL).text

page = requests.get(URL)
soup = BeautifulSoup(page.content, "html5lib")

text = soup.body.get_text(separator= '\n', strip=True)
print(text)

Global CMT Catalog
Search criteria:
Start date: 1976/1/1   End date: 1976/12/30
-90 <=lat<= 90          -180 <=lon<= 180 
0 <=depth<= 1000         -9999 <=time shift<= 9999
0 <=mb<= 10        0<=Ms<= 10           0<=Mw<= 10
0 <=tension plunge<= 90         0 <=null plunge<= 90
Results
Output in
GMT
psmeca (GMT v>3.3) format
Columns: lon lat depth mrr mtt mpp mrt mrp mtp iexp name
-176.96 -29.25 48 7.68 0.09 -7.77 1.39 4.52 -3.26 26 X Y 010176A        
-75.14 -13.42 85 -1.78 -0.59 2.37 -1.28 1.97 -2.90 24 X Y 010576A        
159.50 51.45 15 1.10 -0.30 -0.80 1.05 1.24 -0.56 25 X Y 010676A
...

I'm still new to python/webscraping but I would like to extract the data from containing (Columns: lon lat depth mrr mtt mpp mrt mrp mtp iexp name) excluding the footer information (End of events found with given criteria.) and beyond.

The output would contain column information: lon lat depth mrr mtt mpp mrt mrp mtp iexp name

Then the data (e.g.): -176.96 -29.25 48 7.68 0.09 -7.77 1.39 4.52 -3.26 26 X Y 010176A

Upvotes: 2

Views: 732

Answers (3)

DougR
DougR

Reputation: 3479

You can use the split command to divide the text into sections. I also cleaned up the table to remove the "X Y".

import requests
from bs4 import BeautifulSoup
import pandas as pd
import io

URL = "https://www.globalcmt.org/cgi-bin/globalcmt-cgi-bin/CMT5/form?itype=ymd&yr=1976&mo=1&day=1&oyr=1976&omo=1&oday=1&jyr=1976&jday=1&ojyr=1976&ojday=1&otype=nd&nday=365&lmw=0&umw=10&lms=0&ums=10&lmb=0&umb=10&llat=-90&ulat=90&llon=-180&ulon=180&lhd=0&uhd=1000&lts=-9999&uts=9999&lpe1=0&upe1=90&lpe2=0&upe2=90&list=6"
r = requests.get(URL).text
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
text = soup.body.get_text(separator= '\n', strip=True)

split_top, split_main = text.split('\nColumns: ')
split_main, split_end = split_main.split('\nEnd of events found with given criteria.')

# remove the "X Y "
split_main = split_main.replace(' X Y ',' ')

df = pd.read_csv(io.StringIO(split_main),sep='\s+')
df

enter image description here

Upvotes: 0

HedgeHog
HedgeHog

Reputation: 25048

You could create a list of dicts from header and values:

header = soup.select_one('pre:nth-of-type(2)').find_previous(text=True).split()[1:]
header[10:10] = ['x','y']

for l in soup.select_one('pre:nth-of-type(2)').text.splitlines():
    d = l.split()
    #d[10:13] = [' '.join([str(x) for x in d[10:13]])]
    # del d[10:12]
    data.append(dict(zip(header,d)))

Tricky part in my opinion is that you have to handle the the last elements in your list to avoid missmatch to headers.

Assuming "X Y ..." belong together:

d[10:13] = [' '.join([str(x) for x in d[10:13]])]

or if they are not needed simply delete them:

del d[10:12]

or adjust the headers instead:

header[10:10] = ['x','y']

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://www.globalcmt.org/cgi-bin/globalcmt-cgi-bin/CMT5/form?itype=ymd&yr=1976&mo=1&day=1&oyr=1976&omo=1&oday=1&jyr=1976&jday=1&ojyr=1976&ojday=1&otype=nd&nday=365&lmw=0&umw=10&lms=0&ums=10&lmb=0&umb=10&llat=-90&ulat=90&llon=-180&ulon=180&lhd=0&uhd=1000&lts=-9999&uts=9999&lpe1=0&upe1=90&lpe2=0&upe2=90&list=6"
r = requests.get(URL).text

page = requests.get(URL)
soup = BeautifulSoup(page.content, "html5lib")

data = []

header = soup.select_one('pre:nth-of-type(2)').find_previous(text=True).split()[1:]
header[10:10] = ['x','y']

for l in soup.select_one('pre:nth-of-type(2)').text.splitlines():
    d = l.split()
    #d[10:13] = [' '.join([str(x) for x in d[10:13]])]
    # del d[10:12]
    data.append(dict(zip(header,d)))

pd.DataFrame(data)

Output

lon lat depth mrr mtt mpp mrt mrp mtp iexp x y name
0 -176.96 -29.25 48 7.68 0.09 -7.77 1.39 4.52 -3.26 26 X Y 010176A
1 -75.14 -13.42 85 -1.78 -0.59 2.37 -1.28 1.97 -2.9 24 X Y 010576A
2 159.5 51.45 15 1.1 -0.3 -0.8 1.05 1.24 -0.56 25 X Y 010676A
3 167.81 -15.97 174 -1.7 2.29 -0.59 -2.33 -1.23 2.01 25 X Y 010976A
4 -16.29 66.33 15 -0.51 -2.86 3.37 0.05 -0.78 -0.86 25 X Y 011376A
5 -177.04 -29.69 47 4.78 -0.49 -4.3 0.83 3.62 -1.32 27 X Y 011476A
6 -176.75 -28.72 18 2.56 0.18 -2.74 3.58 6.77 -1.23 27 X Y 011476B
7 -176.62 -28.61 15 2.34 0.24 -2.58 0.62 3.71 -0.68 25 X Y 011476C
8 -176.63 -30.25 15 1.44 0.06 -1.5 0.3 1.18 -0.46 25 X Y 011576A

...

Upvotes: 2

PlainRavioli
PlainRavioli

Reputation: 1211

Assuming the data part will always start with "Columns:"

match = re.finditer(r"Columns\:", text) #find the data part with regex
index_data = next(match).start(0) #take the position of the found regex regex

lines = [x.strip().split(" ") for x in text[len("Columns\:"):].split("\n")] #split the lines with line breaks, and split each line with space
columns = lines[0] #first line is the column names
values = lines[1:] # other are the values
print(columns)
print(values)

output:

['lon', 'lat', 'depth', 'mrr', 'mtt', 'mpp', 'mrt', 'mrp', 'mtp', 'iexp', 'name']
[['-176.96', '-29.25', '48', '7.68', '0.09', '-7.77', '1.39', '4.52', '-3.26', '26', 'X', 'Y', '010176A'], ['-75.14', '-13.42', '85', '-1.78', '-0.59', '2.37', '-1.28', '1.97', '-2.90', '24', 'X', 'Y', '010576A'], ['159.50', '51.45', '15', '1.10', '-0.30', '-0.80', '1.05', '1.24', '-0.56', '25', 'X', 'Y', '010676A']]

Hope this help !

Upvotes: 0

Related Questions