Reputation: 45
I want to scrape data from NOAA (https://gml.noaa.gov/grad/solcalc/). The data I want to get is sunrise and sunset timings for various counties of the US in the last 3 years. I have the coordinates of those counties. Now the problem which I am facing is I don't know how can I use those coordinates and set time frame to 3 years, while scraping the site such that i don't have to manually specify it each time.
I am using python for scraping.
**I need data in the following format:
latitude | Longitude | year | Month | day | Sunrise | sunset**
I am new to programming I tried available methods listed on web, but nothing served my purpose.
Upvotes: 0
Views: 143
Reputation: 120559
You can use the table.php
page to get your data and read them with Pandas. This php script need 3 parameters: year
, lat
and lon
.
import pandas as pd
import requests
import time
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0'
}
# Fill this table with your counties
counties = {
'NY': {'lat': 40.72, 'lon': -74.02},
'LA': {'lat': 37.77, 'lon': -122.42}
}
url = 'https://gml.noaa.gov/grad/solcalc/table.php'
dataset = []
for year in range(2020, 2023):
for county, params in counties.items():
print(year, county)
payload = params | {'year': year}
r = requests.get(url, headers=headers, params=payload)
dfs = pd.read_html(r.text)
# Reshape your data
dfs = (pd.concat(dfs, keys=['Sunrise', 'Sunset', 'SolarNoon']).droplevel(1)
.assign(Year=year, Lat=params['lat'], Lon=params['lon'])
.set_index(['Lat', 'Lon', 'Year', 'Day'], append=True)
.rename_axis(columns='Month').stack('Month')
.unstack(level=0).reset_index())
dataset.append(dfs)
time.sleep(10) # Wait at least 10 seconds not to be banned
out = pd.concat(dataset, ignore_index=True)
out.to_csv('solarcalc.csv', index=False)
Output:
Lat Lon Year Day Month SolarNoon Sunrise Sunset
0 40.72 -74.02 2020 1 Jan 11:59:16 07:20 16:39
1 40.72 -74.02 2020 1 Feb 12:09:33 07:07 17:13
2 40.72 -74.02 2020 1 Mar 12:08:22 06:29 17:48
3 40.72 -74.02 2020 1 Apr 12:59:52 06:39 19:21
4 40.72 -74.02 2020 1 May 12:53:10 05:54 19:53
... ... ... ... ... ... ... ... ...
2187 37.77 -122.42 2022 31 May 13:07:22 05:50 20:25
2188 37.77 -122.42 2022 31 Jul 13:16:06 06:12 20:19
2189 37.77 -122.42 2022 31 Aug 13:10:04 06:39 19:40
2190 37.77 -122.42 2022 31 Oct 12:53:15 07:34 18:12
2191 37.77 -122.42 2022 31 Dec 12:12:35 07:25 17:01
[2192 rows x 8 columns]
Note: if you prefer Month
as number, use:
month2num = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
out['Month'] = out['Month'].replace(month2num)
Upvotes: 0