Reputation: 453
Here is an example of one row of my data frame:
test <- structure(list(index = 1,
id = 100,
creation_date = "2020-11-11 11:38:00",
json_data = "{'alldata': {'values': {'set': {'min': 60, 'max': 190}, 'value': [[112, 111, 89, 78, 178, 90, 80, 56, 80, 95, 120, 140, 160, 156, 100, 77, 90, 89, 100, 112, 111, 89, 78, 178, 90, 80]], 'optimal': {'min': 40, 'max': 190}, 'feature': 'postprandial after 3pm', 'type': {'timerange': 'Yes 3PM-7PM', 'hrs': '4Hr'}}, 'name': 'insight.A'}}"
))
You can see the data is quite complex because there are a ton of numbers in the "values" list within the json_data. I would like to extract all of this information into a data frame while keeping the non-jason_data columns as well.
Ideal output:
df <- data.frame(index = rep(1, each = 26),
id = rep(100, each = 26),
creation_date = rep("2020-11-11 11:38:00", each = 26),
values_set_min = rep(60, each = 26),
values_set_max = rep(190, each = 26),
value = c(112, 111, 89, 78, 178, 90, 80, 56, 80, 95, 120, 140, 160, 156, 100, 77, 90, 89, 100, 112, 111, 89, 78, 178, 90, 80),
optimal_min = rep(40, each = 26),
optimal_max = rep(190, each = 26),
feature = rep("Postprandial after 3PM", each = 26),
type_timerange = rep("Yes 3PM-7PM", each = 26),
type_hrs = rep("4HR", each = 26),
name = rep("insight.A", each = 26)
)
head(df)
index id creation_date values_set_min values_set_max value optimal_min optimal_max feature type_timerange type_hrs name
1 1 100 2020-11-11 11:38:00 60 190 112 40 190 Postprandial after 3PM Yes 3PM-7PM 4HR insight.A
2 1 100 2020-11-11 11:38:00 60 190 111 40 190 Postprandial after 3PM Yes 3PM-7PM 4HR insight.A
3 1 100 2020-11-11 11:38:00 60 190 89 40 190 Postprandial after 3PM Yes 3PM-7PM 4HR insight.A
4 1 100 2020-11-11 11:38:00 60 190 78 40 190 Postprandial after 3PM Yes 3PM-7PM 4HR insight.A
5 1 100 2020-11-11 11:38:00 60 190 178 40 190 Postprandial after 3PM Yes 3PM-7PM 4HR insight.A
6 1 100 2020-11-11 11:38:00 60 190 90 40 190 Postprandial after 3PM Yes 3PM-7PM 4HR insight.A
I have tried jsonlite etc but they always lose the non-JSON columns such as creation_date, which I need. Any help would be much appreciated. Thanks in advance.
Upvotes: 2
Views: 165
Reputation: 887118
We may need to replace the '
with "
using gsub
. THen flatten
the list
recursively on the 'json_data' element, stack
the named list
into two column dataset, reshape to 'wide' format with pivot_wider
and fill
the missing elements with non-NA adjacent element
library(jsonlite)
library(dplyr)
library(purrr)
library(data.table)
library(tidyr)
rrapply::rrapply(fromJSON(gsub("'", '"', test$json_data),
flatten = TRUE), f = unlist, how = 'flatten') %>%
append(test[-length(test)], .) %>%
map(as.vector) %>%
stack %>%
mutate(rn = rowid(ind)) %>%
pivot_wider(names_from = ind, values_from = values) %>%
select(-rn) %>%
fill(everything()) %>%
type.convert(as.is = TRUE)
-output
# A tibble: 26 x 10
index id creation_date min max value feature timerange hrs name
<int> <int> <chr> <int> <int> <int> <chr> <chr> <chr> <chr>
1 1 100 2020-11-11 11:38:00 60 190 112 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
2 1 100 2020-11-11 11:38:00 40 190 111 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
3 1 100 2020-11-11 11:38:00 40 190 89 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
4 1 100 2020-11-11 11:38:00 40 190 78 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
5 1 100 2020-11-11 11:38:00 40 190 178 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
6 1 100 2020-11-11 11:38:00 40 190 90 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
7 1 100 2020-11-11 11:38:00 40 190 80 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
8 1 100 2020-11-11 11:38:00 40 190 56 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
9 1 100 2020-11-11 11:38:00 40 190 80 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
10 1 100 2020-11-11 11:38:00 40 190 95 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
# … with 16 more rows
Or this can be also done as
library(tibble)
library(stringr)
as_tibble(test) %>%
mutate(json_data = list(str_replace_all(json_data, "'", '"') %>%
fromJSON(flatten = TRUE) %>%
rrapply::rrapply(f = unlist, how = 'flatten') %>%
map(as.vector))) %>%
unnest_wider(json_data) %>%
unnest(value)
-output
# A tibble: 26 x 12
index id creation_date min...1 max...2 value min...4 max...5 feature timerange hrs name
<dbl> <dbl> <chr> <int> <int> <int> <int> <int> <chr> <chr> <chr> <chr>
1 1 100 2020-11-11 11:38:00 60 190 112 40 190 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
2 1 100 2020-11-11 11:38:00 60 190 111 40 190 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
3 1 100 2020-11-11 11:38:00 60 190 89 40 190 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
4 1 100 2020-11-11 11:38:00 60 190 78 40 190 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
5 1 100 2020-11-11 11:38:00 60 190 178 40 190 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
6 1 100 2020-11-11 11:38:00 60 190 90 40 190 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
7 1 100 2020-11-11 11:38:00 60 190 80 40 190 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
8 1 100 2020-11-11 11:38:00 60 190 56 40 190 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
9 1 100 2020-11-11 11:38:00 60 190 80 40 190 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
10 1 100 2020-11-11 11:38:00 60 190 95 40 190 postprandial after 3pm Yes 3PM-7PM 4Hr insight.A
# … with 16 more rows
Upvotes: 2