jjkennedy
jjkennedy

Reputation: 81

Scraping table data with BeautifulSoup or Pandas

I'm somewhat new to using python and I've been given a task that requires data scraping from a table. I do not know very much html either. I've never done this before and have spent a couple days looking at various ways to scrape tables. Unfortunately all of the examples are of what appears to be a more simple webpage layout than what I'm dealing with. I've tried quite a few various methods, but none of them allow me to select the table data that I need.

How would one scrape the table at the bottom of the following webpage under the "Daily Water Level" tab?

url = https://apps.wrd.state.or.us/apps/gw/gw_info/gw_hydrograph/Hydrograph.aspx?gw_logid=HARN0052657

I've tried using the methods in the following links and others not show here:

Beautiful Soup Scraping table

Scrape table with BeautifulSoup

Web scraping with BeautifulSoup

Some of the script I've tried:

from bs4 import BeautifulSoup
import requests

html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("table")  # {"class": "xxxx"})  

I've also tried using pandas, but I can't figure out how to select the table I need instead of the first table on the webpage that has the basic well information:

import pandas as pd
df_list = pd.read_html(url)
df_list

Unfortunately the data I need doesn't even show up when I run this script and the table I'm trying to select doesn't have a class that I can use to select only that table and not the table of basic well information. I've inspected the webpage, but can't seem to find a way to get to the correct table.

As far as the final result would look, I would need to export it as a csv or as a pandas data frame so that I can then graph it with modeled groundwater data for comparison purposes. Any suggestions would be greatly appreciated!

Upvotes: 1

Views: 374

Answers (1)

Vin
Vin

Reputation: 986

Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.

What exactly below script is doing:

  1. First it will take the API URL and do a GET request with the dynamic parameters(in CAPS) you can change the value of Well No, Start and end date to get the desired result.

  2. After getting the data script will parse the JSON data using json.loads library.

  3. It will iterate all over the list of daily water level data and create a list of all the data points so that it can be used to create a CSV file for ex:- GW Login Id, GW Site ID, Land Surface Elevation, Record date etc.

  4. Finally it will write all the headers and data in the CSV file. (! Important please make sure to input the file path in the file_path variable)

     import json
     import requests
     from urllib3.exceptions import InsecureRequestWarning
     requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
     import csv
    
     def scrap_daily_water_level():
    
     file_path = '' #Input File path here
     file_name = 'daily_water_level_data.csv' #File name
    
     #CSV headers
     csv_headers = ['Line #','GW Log Id','GW Site Id', 'Land Surface Elevation', 'Record Date','Restrict to OWRD only', 'Reviewed Status', 'Reviewed Status Description', 'Water level ft above mean sea level', 'Water level ft below land surface'] 
     list_of_water_readings = []
    
     #Dynamic Params
     WELL_NO = 'HARN0052657'
     START_DATE = '1/1/1905'
     END_DATE = '12/30/2050'
    
     #API URL
     URL = 'https://apps.wrd.state.or.us/apps/gw/gw_data_rws/api/' + WELL_NO + '/gw_recorder_water_level_daily_mean_public/?start_date=' + START_DATE + '&end_date=' + END_DATE + '&reviewed_status=&restrict_to_owrd_only=n'
    
     response = requests.get(URL,verify=False) #GET API call
     json_result = json.loads(response.text) #JSON loads to parse JSON data
    
     print('Daily water level data count ',json_result['feature_count']) # Prints no. of data counts
     extracted_data = json_result['feature_list'] #Extracted data in JSON form
    
     for idx, item in enumerate(extracted_data): #Iterate over the list of extracted data
         list_of_water_readings.append({ #append and create list of data with headers for further usage
                                     'Line #': idx + 1, 
                                     'GW Log Id' : item['gw_logid'],
                                     'GW Site Id': item['gw_site_id'],
                                     'Land Surface Elevation': item['land_surface_elevation'], 
                                     'Record Date': item['record_date'],
                                     'Restrict to OWRD only': item['restrict_to_owrd_only'],
                                     'Reviewed Status':item['reviewed_status'],
                                     'Reviewed Status Description': item['reviewed_status_description'],
                                     'Water level ft above mean sea level': item['waterlevel_ft_above_mean_sea_level'],
                                     'Water level ft below land surface': item['waterlevel_ft_below_land_surface']
                                     })
     #Create CSV and write data in to it.
     with open(file_path + file_name ,'a+') as daily_water_level_data_CSV: #Open file in a+ mode 
         csvwriter = csv.DictWriter(daily_water_level_data_CSV, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
         print('Writing CSV header now...')
         csvwriter.writeheader() #Write headers in CSV file
         for item in list_of_water_readings: #iterate over the appended data and save them in to the CSV file.
             print('Writing data rows now..')
             print(item)            
             csvwriter.writerow(item)
    
     scrap_daily_water_level()
    

Upvotes: 1

Related Questions