Reputation: 65

Is this the most concise way to iterate and capture output from API in R?

I want to iterate through a sequence of years and capture each output in one large dataframe. The query only allows one year at a time of data to be requested so I thought I could run a loop like below and capture into an empty dataframe. This seems to work but I was wondering if there is a more concise way of achieving this.

In case anyone is interested. API info and signup:https://aqs.epa.gov/aqsweb/documents/data_api.html#bdate

library("jsonlite")
library(lubridate)
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"

years <- as.character(2011:2019)
dat = {}
for (year in years) {
  my_raw_result <- httr::GET(base_url_site,
                             query = list(email="[email protected]",key=Sys.getenv("AQS_KEY"),
                                          param = "44201",
                                          bdate=paste(year,"0101",sep = ""),
                                          edate=paste(year,"1231",sep = ""),state="48",
                                          county="141", site="0055"))
  my_content <- httr::content(my_raw_result, as = 'text')
  my_content_from_json <- fromJSON(my_content)
  df <- my_content_from_json$Data
  dat = rbind(dat,df)
}

Upvotes: 1

Answers (1)

Len Greski

Reputation: 10855

A slightly more efficient solution may be obtained by using rbind() only once, rather than iteratively in the loop. We can do this with a combination of Base R and lapply(). The key change in order to make the code work was converting the list output from the fromJSON() function into a data frame, which did not work correctly in the code posted with the original question.

# set private key
Sys.setenv(AQS_KEY = "yourKeyGoesHere")

base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
library(RJSONIO)
library(tidyr)
years <- as.character(2011:2019)

system.time(dfList <- lapply(years,function(year){
     my_raw_result <- httr::GET(base_url_site,
                                query = list(email="[email protected]",key=Sys.getenv("AQS_KEY"),
                                             param = "44201",
                                             bdate=paste(year,"0101",sep = ""),
                                             edate=paste(year,"1231",sep = ""),state="48",
                                             county="141", site="0055"))
     my_content <- httr::content(my_raw_result, as = 'text')
     my_content_from_json <- fromJSON(my_content)
     df <- data.frame(t(sapply(my_content_from_json$Data,c))) 
     df$uncertainty <- " "
     tidyr::unnest(df,cols = colnames(df)) # unnest & return to parent
}))
system.time(combinedData <- do.call(rbind,dfList))

The code to extract years 2011 - 2019 from the EPA database runs in about 46.8 seconds of user time, including the initial extracts, unnesting of each resulting data structure, and the one time combination of data frames at the end.

   user  system elapsed 
 46.670   0.756  71.432 
> system.time(combinedData <- data.frame(do.call(rbind,dfList)))
   user  system elapsed 
  0.096   0.027   0.123

The large difference between user time and elapsed time is likely due to wait times to receive data from the API.

A key feature of this solution is the technique used to convert the list of lists into data frame rows, which is accomplished as follows (h/t Alex Brown's answer for Convert a List to a Data Frame, as well as the unnesting of the resulting data structure with tidyr::unnest(). We also had to set the uncertainty column to blank, because unnest() fails with the NULL values extracted from the EPA API.

 df <- data.frame(t(sapply(my_content_from_json$Data,c)))
 df$uncertainty <- " "
 tidyr::unnest(df,cols = colnames(df)) # unnest & return to parent

Output from the combined data frame looks like this.

> head(combinedData)
  state_code county_code site_number parameter_code poc latitude longitude datum
1         48         141        0055          44201   1 31.74677 -106.4028 WGS84
2         48         141        0055          44201   1 31.74677 -106.4028 WGS84
3         48         141        0055          44201   1 31.74677 -106.4028 WGS84
4         48         141        0055          44201   1 31.74677 -106.4028 WGS84
5         48         141        0055          44201   1 31.74677 -106.4028 WGS84
6         48         141        0055          44201   1 31.74677 -106.4028 WGS84
  parameter date_local time_local   date_gmt time_gmt sample_measurement
1     Ozone 2011-12-31      23:00 2012-01-01    06:00              0.023
2     Ozone 2011-12-31      22:00 2012-01-01    05:00                 NA
3     Ozone 2011-12-31      21:00 2012-01-01    04:00                 NA
4     Ozone 2011-12-31      20:00 2012-01-01    03:00              0.018
5     Ozone 2011-12-31      19:00 2012-01-01    02:00              0.006
6     Ozone 2011-12-31      18:00 2012-01-01    01:00              0.002
   units_of_measure units_of_measure_code sample_duration sample_duration_code
1 Parts per million                   007          1 HOUR                    1
2 Parts per million                   007          1 HOUR                    1
3 Parts per million                   007          1 HOUR                    1
4 Parts per million                   007          1 HOUR                    1
5 Parts per million                   007          1 HOUR                    1
6 Parts per million                   007          1 HOUR                    1
                sample_frequency detection_limit uncertainty
1 DAILY: 24 - 1 HR SAMPLES -PAMS           0.005            
2 DAILY: 24 - 1 HR SAMPLES -PAMS           0.005            
3 DAILY: 24 - 1 HR SAMPLES -PAMS           0.005            
4 DAILY: 24 - 1 HR SAMPLES -PAMS           0.005            
5 DAILY: 24 - 1 HR SAMPLES -PAMS           0.005            
6 DAILY: 24 - 1 HR SAMPLES -PAMS           0.005            
                  qualifier method_type                                 method
1                      <NA>         FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
2 BF - Precision/Zero/Span.         FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
3 BF - Precision/Zero/Span.         FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
4                      <NA>         FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
5                      <NA>         FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
6                      <NA>         FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
  method_code state  county date_of_last_change cbsa_code
1         087 Texas El Paso          2012-01-23     21340
2         087 Texas El Paso          2012-01-23     21340
3         087 Texas El Paso          2012-01-23     21340
4         087 Texas El Paso          2012-01-23     21340
5         087 Texas El Paso          2012-01-23     21340
6         087 Texas El Paso          2012-01-23     21340

The original code, updated to result in a data frame without nested lists, runs in about 43.6 seconds, about 3 seconds faster than the lapply() version which is a bit surprising.

base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"

years <- as.character(2011:2019)
dat = {}
 
system.time(for (year in years) {
     my_raw_result <- httr::GET(base_url_site,
                                query = list(email="[email protected]",key=Sys.getenv("AQS_KEY"),
                                             param = "44201",
                                             bdate=paste(year,"0101",sep = ""),
                                             edate=paste(year,"1231",sep = ""),state="48",
                                             county="141", site="0055"))
     my_content <- httr::content(my_raw_result, as = 'text')
     my_content_from_json <- fromJSON(my_content)
     dataList <- my_content_from_json$Data
     df <- data.frame(t(sapply(dataList,c)))[!(colnames(df) == "uncertainty")]
     unnestedDf <- tidyr::unnest(df,cols = colnames(df))
     dat <- rbind(dat,unnestedDf)
 })

...and the runtime stats, which show the same pattern of elapsed time relative to user time:

   user  system elapsed 
 43.586   0.686  66.604

Upvotes: 1

Is this the most concise way to iterate and capture output from API in R?

Answers (1)

Related Questions