Scraping table data with BeautifulSoup or Pandas

Question

I'm somewhat new to using python and I've been given a task that requires data scraping from a table. I do not know very much html either. I've never done this before and have spent a couple days looking at various ways to scrape tables. Unfortunately all of the examples are of what appears to be a more simple webpage layout than what I'm dealing with. I've tried quite a few various methods, but none of them allow me to select the table data that I need.

How would one scrape the table at the bottom of the following webpage under the "Daily Water Level" tab?

url = https://apps.wrd.state.or.us/apps/gw/gw_info/gw_hydrograph/Hydrograph.aspx?gw_logid=HARN0052657

I've tried using the methods in the following links and others not show here:

Beautiful Soup Scraping table

Scrape table with BeautifulSoup

Web scraping with BeautifulSoup

Some of the script I've tried:

from bs4 import BeautifulSoup
import requests

html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("table")  # {"class": "xxxx"})

I've also tried using pandas, but I can't figure out how to select the table I need instead of the first table on the webpage that has the basic well information:

import pandas as pd
df_list = pd.read_html(url)
df_list

Unfortunately the data I need doesn't even show up when I run this script and the table I'm trying to select doesn't have a class that I can use to select only that table and not the table of basic well information. I've inspected the webpage, but can't seem to find a way to get to the correct table.

As far as the final result would look, I would need to export it as a csv or as a pandas data frame so that I can then graph it with modeled groundwater data for comparison purposes. Any suggestions would be greatly appreciated!

Vin · Accepted Answer

Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.

What exactly below script is doing:

First it will take the API URL and do a GET request with the dynamic parameters(in CAPS) you can change the value of Well No, Start and end date to get the desired result.
After getting the data script will parse the JSON data using json.loads library.
It will iterate all over the list of daily water level data and create a list of all the data points so that it can be used to create a CSV file for ex:- GW Login Id, GW Site ID, Land Surface Elevation, Record date etc.

Finally it will write all the headers and data in the CSV file. (! Important please make sure to input the file path in the file_path variable)

 import json
 import requests
 from urllib3.exceptions import InsecureRequestWarning
 requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
 import csv

 def scrap_daily_water_level():

 file_path = '' #Input File path here
 file_name = 'daily_water_level_data.csv' #File name

 #CSV headers
 csv_headers = ['Line #','GW Log Id','GW Site Id', 'Land Surface Elevation', 'Record Date','Restrict to OWRD only', 'Reviewed Status', 'Reviewed Status Description', 'Water level ft above mean sea level', 'Water level ft below land surface'] 
 list_of_water_readings = []

 #Dynamic Params
 WELL_NO = 'HARN0052657'
 START_DATE = '1/1/1905'
 END_DATE = '12/30/2050'

 #API URL
 URL = 'https://apps.wrd.state.or.us/apps/gw/gw_data_rws/api/' + WELL_NO + '/gw_recorder_water_level_daily_mean_public/?start_date=' + START_DATE + '&end_date=' + END_DATE + '&reviewed_status=&restrict_to_owrd_only=n'

 response = requests.get(URL,verify=False) #GET API call
 json_result = json.loads(response.text) #JSON loads to parse JSON data

 print('Daily water level data count ',json_result['feature_count']) # Prints no. of data counts
 extracted_data = json_result['feature_list'] #Extracted data in JSON form

 for idx, item in enumerate(extracted_data): #Iterate over the list of extracted data
     list_of_water_readings.append({ #append and create list of data with headers for further usage
                                 'Line #': idx + 1, 
                                 'GW Log Id' : item['gw_logid'],
                                 'GW Site Id': item['gw_site_id'],
                                 'Land Surface Elevation': item['land_surface_elevation'], 
                                 'Record Date': item['record_date'],
                                 'Restrict to OWRD only': item['restrict_to_owrd_only'],
                                 'Reviewed Status':item['reviewed_status'],
                                 'Reviewed Status Description': item['reviewed_status_description'],
                                 'Water level ft above mean sea level': item['waterlevel_ft_above_mean_sea_level'],
                                 'Water level ft below land surface': item['waterlevel_ft_below_land_surface']
                                 })
 #Create CSV and write data in to it.
 with open(file_path + file_name ,'a+') as daily_water_level_data_CSV: #Open file in a+ mode 
     csvwriter = csv.DictWriter(daily_water_level_data_CSV, delimiter=',', lineterminator='
',fieldnames=csv_headers)
     print('Writing CSV header now...')
     csvwriter.writeheader() #Write headers in CSV file
     for item in list_of_water_readings: #iterate over the appended data and save them in to the CSV file.
         print('Writing data rows now..')
         print(item)            
         csvwriter.writerow(item)

 scrap_daily_water_level()

Scraping table data with BeautifulSoup or Pandas

Answers (1)

Related Questions