Reputation: 65
I want to iterate through a sequence of years and capture each output in one large dataframe. The query only allows one year at a time of data to be requested so I thought I could run a loop like below and capture into an empty dataframe. This seems to work but I was wondering if there is a more concise way of achieving this.
In case anyone is interested. API info and signup:https://aqs.epa.gov/aqsweb/documents/data_api.html#bdate
library("jsonlite")
library(lubridate)
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
years <- as.character(2011:2019)
dat = {}
for (year in years) {
my_raw_result <- httr::GET(base_url_site,
query = list(email="[email protected]",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
df <- my_content_from_json$Data
dat = rbind(dat,df)
}
Upvotes: 1
Views: 67
Reputation: 10855
A slightly more efficient solution may be obtained by using rbind()
only once, rather than iteratively in the loop. We can do this with a combination of Base R and lapply()
. The key change in order to make the code work was converting the list output from the fromJSON()
function into a data frame, which did not work correctly in the code posted with the original question.
# set private key
Sys.setenv(AQS_KEY = "yourKeyGoesHere")
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
library(RJSONIO)
library(tidyr)
years <- as.character(2011:2019)
system.time(dfList <- lapply(years,function(year){
my_raw_result <- httr::GET(base_url_site,
query = list(email="[email protected]",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
df <- data.frame(t(sapply(my_content_from_json$Data,c)))
df$uncertainty <- " "
tidyr::unnest(df,cols = colnames(df)) # unnest & return to parent
}))
system.time(combinedData <- do.call(rbind,dfList))
The code to extract years 2011 - 2019 from the EPA database runs in about 46.8 seconds of user time, including the initial extracts, unnesting of each resulting data structure, and the one time combination of data frames at the end.
user system elapsed
46.670 0.756 71.432
> system.time(combinedData <- data.frame(do.call(rbind,dfList)))
user system elapsed
0.096 0.027 0.123
The large difference between user time and elapsed time is likely due to wait times to receive data from the API.
A key feature of this solution is the technique used to convert the list of lists into data frame rows, which is accomplished as follows (h/t Alex Brown's answer for Convert a List to a Data Frame, as well as the unnesting of the resulting data structure with tidyr::unnest()
. We also had to set the uncertainty
column to blank, because unnest()
fails with the NULL values extracted from the EPA API.
df <- data.frame(t(sapply(my_content_from_json$Data,c)))
df$uncertainty <- " "
tidyr::unnest(df,cols = colnames(df)) # unnest & return to parent
Output from the combined data frame looks like this.
> head(combinedData)
state_code county_code site_number parameter_code poc latitude longitude datum
1 48 141 0055 44201 1 31.74677 -106.4028 WGS84
2 48 141 0055 44201 1 31.74677 -106.4028 WGS84
3 48 141 0055 44201 1 31.74677 -106.4028 WGS84
4 48 141 0055 44201 1 31.74677 -106.4028 WGS84
5 48 141 0055 44201 1 31.74677 -106.4028 WGS84
6 48 141 0055 44201 1 31.74677 -106.4028 WGS84
parameter date_local time_local date_gmt time_gmt sample_measurement
1 Ozone 2011-12-31 23:00 2012-01-01 06:00 0.023
2 Ozone 2011-12-31 22:00 2012-01-01 05:00 NA
3 Ozone 2011-12-31 21:00 2012-01-01 04:00 NA
4 Ozone 2011-12-31 20:00 2012-01-01 03:00 0.018
5 Ozone 2011-12-31 19:00 2012-01-01 02:00 0.006
6 Ozone 2011-12-31 18:00 2012-01-01 01:00 0.002
units_of_measure units_of_measure_code sample_duration sample_duration_code
1 Parts per million 007 1 HOUR 1
2 Parts per million 007 1 HOUR 1
3 Parts per million 007 1 HOUR 1
4 Parts per million 007 1 HOUR 1
5 Parts per million 007 1 HOUR 1
6 Parts per million 007 1 HOUR 1
sample_frequency detection_limit uncertainty
1 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
2 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
3 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
4 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
5 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
6 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
qualifier method_type method
1 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
2 BF - Precision/Zero/Span. FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
3 BF - Precision/Zero/Span. FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
4 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
5 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
6 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
method_code state county date_of_last_change cbsa_code
1 087 Texas El Paso 2012-01-23 21340
2 087 Texas El Paso 2012-01-23 21340
3 087 Texas El Paso 2012-01-23 21340
4 087 Texas El Paso 2012-01-23 21340
5 087 Texas El Paso 2012-01-23 21340
6 087 Texas El Paso 2012-01-23 21340
The original code, updated to result in a data frame without nested lists, runs in about 43.6 seconds, about 3 seconds faster than the lapply()
version which is a bit surprising.
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
years <- as.character(2011:2019)
dat = {}
system.time(for (year in years) {
my_raw_result <- httr::GET(base_url_site,
query = list(email="[email protected]",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
dataList <- my_content_from_json$Data
df <- data.frame(t(sapply(dataList,c)))[!(colnames(df) == "uncertainty")]
unnestedDf <- tidyr::unnest(df,cols = colnames(df))
dat <- rbind(dat,unnestedDf)
})
...and the runtime stats, which show the same pattern of elapsed time relative to user time:
user system elapsed
43.586 0.686 66.604
Upvotes: 1