Reputation: 81
I'm somewhat new to using python and I've been given a task that requires data scraping from a table. I do not know very much html either. I've never done this before and have spent a couple days looking at various ways to scrape tables. Unfortunately all of the examples are of what appears to be a more simple webpage layout than what I'm dealing with. I've tried quite a few various methods, but none of them allow me to select the table data that I need.
How would one scrape the table at the bottom of the following webpage under the "Daily Water Level" tab?
url = https://apps.wrd.state.or.us/apps/gw/gw_info/gw_hydrograph/Hydrograph.aspx?gw_logid=HARN0052657
I've tried using the methods in the following links and others not show here:
Scrape table with BeautifulSoup
Web scraping with BeautifulSoup
Some of the script I've tried:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("table") # {"class": "xxxx"})
I've also tried using pandas, but I can't figure out how to select the table I need instead of the first table on the webpage that has the basic well information:
import pandas as pd
df_list = pd.read_html(url)
df_list
Unfortunately the data I need doesn't even show up when I run this script and the table I'm trying to select doesn't have a class that I can use to select only that table and not the table of basic well information. I've inspected the webpage, but can't seem to find a way to get to the correct table.
As far as the final result would look, I would need to export it as a csv or as a pandas data frame so that I can then graph it with modeled groundwater data for comparison purposes. Any suggestions would be greatly appreciated!
Upvotes: 1
Views: 374
Reputation: 986
Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL and do a GET request with the dynamic parameters(in CAPS) you can change the value of Well No, Start and end date to get the desired result.
After getting the data script will parse the JSON data using json.loads library.
It will iterate all over the list of daily water level data and create a list of all the data points so that it can be used to create a CSV file for ex:- GW Login Id, GW Site ID, Land Surface Elevation, Record date etc.
Finally it will write all the headers and data in the CSV file. (! Important please make sure to input the file path in the file_path variable)
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
def scrap_daily_water_level():
file_path = '' #Input File path here
file_name = 'daily_water_level_data.csv' #File name
#CSV headers
csv_headers = ['Line #','GW Log Id','GW Site Id', 'Land Surface Elevation', 'Record Date','Restrict to OWRD only', 'Reviewed Status', 'Reviewed Status Description', 'Water level ft above mean sea level', 'Water level ft below land surface']
list_of_water_readings = []
#Dynamic Params
WELL_NO = 'HARN0052657'
START_DATE = '1/1/1905'
END_DATE = '12/30/2050'
#API URL
URL = 'https://apps.wrd.state.or.us/apps/gw/gw_data_rws/api/' + WELL_NO + '/gw_recorder_water_level_daily_mean_public/?start_date=' + START_DATE + '&end_date=' + END_DATE + '&reviewed_status=&restrict_to_owrd_only=n'
response = requests.get(URL,verify=False) #GET API call
json_result = json.loads(response.text) #JSON loads to parse JSON data
print('Daily water level data count ',json_result['feature_count']) # Prints no. of data counts
extracted_data = json_result['feature_list'] #Extracted data in JSON form
for idx, item in enumerate(extracted_data): #Iterate over the list of extracted data
list_of_water_readings.append({ #append and create list of data with headers for further usage
'Line #': idx + 1,
'GW Log Id' : item['gw_logid'],
'GW Site Id': item['gw_site_id'],
'Land Surface Elevation': item['land_surface_elevation'],
'Record Date': item['record_date'],
'Restrict to OWRD only': item['restrict_to_owrd_only'],
'Reviewed Status':item['reviewed_status'],
'Reviewed Status Description': item['reviewed_status_description'],
'Water level ft above mean sea level': item['waterlevel_ft_above_mean_sea_level'],
'Water level ft below land surface': item['waterlevel_ft_below_land_surface']
})
#Create CSV and write data in to it.
with open(file_path + file_name ,'a+') as daily_water_level_data_CSV: #Open file in a+ mode
csvwriter = csv.DictWriter(daily_water_level_data_CSV, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
print('Writing CSV header now...')
csvwriter.writeheader() #Write headers in CSV file
for item in list_of_water_readings: #iterate over the appended data and save them in to the CSV file.
print('Writing data rows now..')
print(item)
csvwriter.writerow(item)
scrap_daily_water_level()
Upvotes: 1