Eric
Eric

Reputation: 73

Importing nested JSON structure in R

I'm trying to import some data from an API that has an unusual nested JSON structure. Here's a snippet of what the JSON data looks like:

{"series":[{"series_id":"NG.N9010US2.M","name":"U.S. Natural Gas Gross Withdrawals, Monthly","units":"Million Cubic Feet","f":"M","unitsshort":"MMcf","description":"U.S. Natural Gas Gross Withdrawals","copyright":"None","source":"EIA, U.S. Energy Information Administration","iso3166":"USA","geography":"USA","start":"197301","end":"202112","updated":"2022-02-28T17:53:37-0500","data":[["202112",3683216],["202111",3551244],["202110",3595255],["202109",3412770],["202108",3530753],["202107",3490995],["202106",3391187],["202105",3510356],["202104",3408930],["202103",3482323],["202102",2923896],["202101",3506258],["202012",3489506],["202011",3352374],["202010",3363756],["202009",3265272],["202008",3349817],["202007",3373594],["202006",3217496],["202005",3285105],["202004",3374224],["202003",3582414],["202002",3363434],["202001",3596775],["201912",3622512],["201911",3509293],["201910",3551970],["201909",3396571],["201908",3448106],["201907",3395984],["201906",3299794],["201905",3424018],["201904",3315270],["201903",3383042],["201902",3056916],["201901",3376735],["201812",3391473],["201811",3259912],["201810",3314676],["201809",3153978],["201808",3206177],["201807",3138242],["201806",2972738],["201805",3098558],["201804",2976108],["201803",3069284],["201802",2751215],["201801",2993178],["201712",3025419],["201711",2897003],["201710",2911373],["201709",2763285],["201708",2769670],["201707",2753997],["201706",2688422],["201705",2778965],["201704",2689967],["201703",2797117],["201702",2493267],["201701",2723628]]}

I'm not very familiar with different flavors of JSON. As you can see, the data is nested within the series information, but it's not in the typical paired format. I have been able to import the data as a nested list using fromJSON in the jsonlite package, but I'm stuck on how to convert it into a data frame. I would like the series information to be maintained with a new row for each pair of data observations.

I haven't gotten very far with this, so my code so far is pretty simple. I was able to extract the data series as a list using the $ operator:

gas <- fromJSON("gas production.json", flatten=TRUE)
gas$series$data

Thanks!

Upvotes: 2

Views: 343

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173793

From looking at your data structure, you could do something like this:

result <- jsonlite::fromJSON("gas_production.json")

df <- do.call(rbind, lapply(result$series[[1]]$data, 
              function(x) setNames(as.data.frame(x), c("date", "value"))))

df$date <- as.Date(paste(substr(df$date, 1, 4), substr(df$date, 5, 6), "01", sep = "-"))

Now the data frame contains two columns, date and value

head(df)
        date   value
1 2021-12-01 3683216
2 2021-11-01 3551244
3 2021-10-01 3595255
4 2021-09-01 3412770
5 2021-08-01 3530753
6 2021-07-01 3490995

We could use this for example to plot the series:

library(ggplot2)

ggplot(df, aes(date, value)) +
  geom_line(color = "deepskyblue4") +
  labs(title = "Monthly U.S. Natural Gas Gross Withdrawals",
       x = "Date", y = "Millions of cubic feet") +
  theme_minimal()

enter image description here


EDIT

Here's a full reprex demonstrating the way to get all series in a single data frame:

url <- paste0("https://raw.githubusercontent.com/shrubberr/",
              "energy-data/main/gas%20production.json")

result <- jsonlite::parse_json(httr::content(httr::GET(url), "text"))

df <- do.call(rbind, lapply(result$series, function(d) {
  y <- do.call(rbind, lapply(d$data, 
       function(x) {
         if(is.null(x[[2]])) x[[2]] <- NA_real_; 
       setNames(as.data.frame(x), c("Date", "Value"))}))
  y$Name <- d$name
  y
}))

df$Date <- as.Date(paste(substr(df$Date, 1, 4), 
                         substr(df$Date, 5, 6), "01", sep = "-"))  

library(ggplot2)

ggplot(df, aes(Date, Value, color = Name)) +
  geom_line() +
  labs(title = "Natural Gas Gross Withdrawals",
       x = "Date", y = "Millions of cubic feet", color = "") +
  theme_minimal() +
  theme(legend.position = "top",
        legend.direction = "vertical")
#> Warning: Removed 84 row(s) containing missing values (geom_path).

Created on 2022-03-30 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions