Dr Wampa
Dr Wampa

Reputation: 453

Parsing nested JSON from one row and one column in R

Here is an example of one row of my data frame:

test <- structure(list(index = 1,
                       id = 100,
                       creation_date = "2020-11-11 11:38:00",
                       json_data = "{'alldata': {'values': {'set': {'min': 60, 'max': 190}, 'value': [[112, 111, 89, 78, 178, 90, 80, 56, 80, 95, 120, 140, 160, 156, 100, 77, 90, 89, 100, 112, 111, 89, 78, 178, 90, 80]], 'optimal': {'min': 40, 'max': 190}, 'feature': 'postprandial after 3pm', 'type': {'timerange': 'Yes 3PM-7PM', 'hrs': '4Hr'}}, 'name': 'insight.A'}}"
))

You can see the data is quite complex because there are a ton of numbers in the "values" list within the json_data. I would like to extract all of this information into a data frame while keeping the non-jason_data columns as well.

Ideal output:

df <- data.frame(index = rep(1, each = 26),
                 id = rep(100, each = 26),
                 creation_date = rep("2020-11-11 11:38:00", each = 26),
                 values_set_min = rep(60, each = 26),
                 values_set_max = rep(190, each = 26),
                 value = c(112, 111, 89, 78, 178, 90, 80, 56, 80, 95, 120, 140, 160, 156, 100, 77, 90, 89, 100, 112, 111, 89, 78, 178, 90, 80),
                 optimal_min = rep(40, each = 26),
                 optimal_max = rep(190, each = 26),
                 feature = rep("Postprandial after 3PM", each = 26),
                 type_timerange = rep("Yes 3PM-7PM", each = 26),
                 type_hrs = rep("4HR", each = 26),
                 name = rep("insight.A", each = 26)
                                     )

head(df)

index  id       creation_date values_set_min values_set_max value optimal_min optimal_max                feature type_timerange type_hrs      name
1     1 100 2020-11-11 11:38:00             60            190   112          40         190 Postprandial after 3PM    Yes 3PM-7PM      4HR insight.A
2     1 100 2020-11-11 11:38:00             60            190   111          40         190 Postprandial after 3PM    Yes 3PM-7PM      4HR insight.A
3     1 100 2020-11-11 11:38:00             60            190    89          40         190 Postprandial after 3PM    Yes 3PM-7PM      4HR insight.A
4     1 100 2020-11-11 11:38:00             60            190    78          40         190 Postprandial after 3PM    Yes 3PM-7PM      4HR insight.A
5     1 100 2020-11-11 11:38:00             60            190   178          40         190 Postprandial after 3PM    Yes 3PM-7PM      4HR insight.A
6     1 100 2020-11-11 11:38:00             60            190    90          40         190 Postprandial after 3PM    Yes 3PM-7PM      4HR insight.A

I have tried jsonlite etc but they always lose the non-JSON columns such as creation_date, which I need. Any help would be much appreciated. Thanks in advance.

Upvotes: 2

Views: 165

Answers (1)

akrun
akrun

Reputation: 887118

We may need to replace the ' with " using gsub. THen flatten the list recursively on the 'json_data' element, stack the named list into two column dataset, reshape to 'wide' format with pivot_wider and fill the missing elements with non-NA adjacent element

library(jsonlite)
library(dplyr)
library(purrr)
library(data.table)
library(tidyr)
rrapply::rrapply(fromJSON(gsub("'", '"', test$json_data), 
      flatten = TRUE), f = unlist, how = 'flatten') %>% 
    append(test[-length(test)], .) %>% 
    map(as.vector) %>% 
    stack %>% 
    mutate(rn = rowid(ind)) %>%
    pivot_wider(names_from = ind, values_from = values) %>% 
    select(-rn) %>% 
    fill(everything()) %>% 
    type.convert(as.is = TRUE)

-output

# A tibble: 26 x 10
   index    id creation_date         min   max value feature                timerange   hrs   name     
   <int> <int> <chr>               <int> <int> <int> <chr>                  <chr>       <chr> <chr>    
 1     1   100 2020-11-11 11:38:00    60   190   112 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 2     1   100 2020-11-11 11:38:00    40   190   111 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 3     1   100 2020-11-11 11:38:00    40   190    89 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 4     1   100 2020-11-11 11:38:00    40   190    78 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 5     1   100 2020-11-11 11:38:00    40   190   178 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 6     1   100 2020-11-11 11:38:00    40   190    90 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 7     1   100 2020-11-11 11:38:00    40   190    80 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 8     1   100 2020-11-11 11:38:00    40   190    56 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 9     1   100 2020-11-11 11:38:00    40   190    80 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
10     1   100 2020-11-11 11:38:00    40   190    95 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
# … with 16 more rows

Or this can be also done as

library(tibble)
library(stringr)
as_tibble(test) %>%
    mutate(json_data = list(str_replace_all(json_data, "'", '"') %>% 
             fromJSON(flatten = TRUE) %>%
             rrapply::rrapply(f = unlist, how = 'flatten') %>% 
             map(as.vector))) %>%
    unnest_wider(json_data) %>%
    unnest(value)

-output

# A tibble: 26 x 12
   index    id creation_date       min...1 max...2 value min...4 max...5 feature                timerange   hrs   name     
   <dbl> <dbl> <chr>                 <int>   <int> <int>   <int>   <int> <chr>                  <chr>       <chr> <chr>    
 1     1   100 2020-11-11 11:38:00      60     190   112      40     190 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 2     1   100 2020-11-11 11:38:00      60     190   111      40     190 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 3     1   100 2020-11-11 11:38:00      60     190    89      40     190 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 4     1   100 2020-11-11 11:38:00      60     190    78      40     190 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 5     1   100 2020-11-11 11:38:00      60     190   178      40     190 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 6     1   100 2020-11-11 11:38:00      60     190    90      40     190 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 7     1   100 2020-11-11 11:38:00      60     190    80      40     190 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 8     1   100 2020-11-11 11:38:00      60     190    56      40     190 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
 9     1   100 2020-11-11 11:38:00      60     190    80      40     190 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
10     1   100 2020-11-11 11:38:00      60     190    95      40     190 postprandial after 3pm Yes 3PM-7PM 4Hr   insight.A
# … with 16 more rows

Upvotes: 2

Related Questions