Reputation: 567
I want to append json data to dataframe by looking at whether it does have common key or not, if yes, I want to do value update. So far, I used jsonlite::fromJSON()
to get R list from raw json data, can add json data as new row there is no match with dataframe. I am wondering how am I gonna add value from json data if it has common key with dataframe. Any way to do this task in R?
reproducible example
here is the output of jsonlite::fromJSON(jsondata)
:
mylist = list(version = 0L, subject = "Subject 102", organization = "HOSP",
time = "2018-02-10T09:30:00Z", features = list(age = 81L,
gender = "M", il6 = list(value = 200L, time = "2018-02-10T09:30:00Z"),
pct = list(value = 1234L, time = "2018-02-10T09:30:00Z"),
crp = list(value = 1004L, time = "2018-02-10T09:30:00Z")))
here is dataframe:
mydf = structure(list(subject= c('Subject 100', 'Subject 101','Subject 102','Subject 103'), taken_time = c(15L, 5L, 39L, -21L), il6= c(203.2893857,
4.858269406, 4.858269406, 14220), pct = c(732.074484, 25.67901235,
119.0221384, 120.0477168), crp = c(196115424.7, 1073624.455,
4092823.885, 1457496.474), age = c(52L, 74L, 81L, 67L), gender = structure(c(2L,
2L, 2L, 1L), .Label = c("F", "M"), class = "factor"), inpatient_readmission_time_rtd = c(79.78819444,
57.59068053, 57.59068053, 57.59068053), infection_flag = c(0L,
0L, 1L, 1L), temperature_value = c(98.9, 98.9, 98, 101.3), heartrate_value = c(106,
61, 78, 91), pH_result_time_rta = c(11, 85.50402145, 85.50402145,
85.50402145), gcst_value = c(15, 15, 15, 14.63769293)), row.names = c(NA,
4L), class = "data.frame")
current attempt:
based on @akrun suggestion, here is what I tried:
unlist(mylist) %>% setNames(str_remove(names(.), 'features\\.')) %>%
as.list %>% as_tibble %>% type.convert(as.is = TRUE) %>%
bind_rows(mydf, .) %>% select(-contains(c(".time", "version", "organization", "taken_time"))) %>% as.data.frame()
but I still can't able to add value in json data to dataframe by finding common key such as person_id
, crp
, pct
, il6
, then I just want to do value update. How can I do that easily in R? any idea?
I think using dplyr
can do this manipulation, but I don't know how to get going from above attempt. any idea?
desired output:
this is the desired output that I want to get:
structure(list(person_id = structure(1:4, .Label = c("Subject 100",
"Subject 101", "Subject 102", "Subject 103"), class = "factor"),
taken_time = c(15L, 5L, 39L, -21L), il6 = c(203.2893857,
4.858269406, 200, 14220), pct = c(732.074484, 25.67901235,
1234, 120.0477168), crp = c(196115424.7, 1073624.455, 1004,
1457496.474), age = c(52L, 74L, 81L, 67L), gender = structure(c(2L,
2L, 2L, 1L), .Label = c("F", "M"), class = "factor"), inpatient_readmission_time_rtd = c(79.78819444,
57.59068053, 57.59068053, 57.59068053), infection_flag = c(0L,
0L, 1L, 1L), temperature_value = c(98.9, 98.9, 98, 101.3),
heartrate_value = c(106L, 61L, 78L, 91L), pH_result_time_rta = c(11,
85.50402145, 85.50402145, 85.50402145), gcst_value = c(15,
15, 15, 14.63769293)), class = "data.frame", row.names = c(NA,
-4L))
how can I get above desired dataframe? Is there any way to achieve this by using dplyr
utilities? thanks
update: R sessioninfo:
here is the output of R sessioninfo:
> > sessionInfo() R version 3.6.0 (2019-04-26) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build
> 18362)
>
> Matrix products: default
>
> Random number generation: RNG: Mersenne-Twister Normal:
> Inversion Sample: Rounding locale: [1] LC_COLLATE=English_United
> States.1252 LC_CTYPE=English_United States.1252 [3]
> LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages: [1] stats graphics grDevices utils
> datasets methods base
>
> other attached packages: [1] stringr_1.4.0 ranger_0.12.1
> dplyr_0.8.5 caret_6.0-86 ggplot2_3.3.0 lattice_0.20-38 [7]
> jsonlite_1.6.1
>
> loaded via a namespace (and not attached): [1] Rcpp_1.0.3
> pillar_1.4.3 compiler_3.6.0 gower_0.2.1
> plyr_1.8.6 [6] iterators_1.0.12 class_7.3-15
> tools_3.6.0 rpart_4.1-15 ipred_0.9-9 [11]
> lubridate_1.7.4 lifecycle_0.2.0 tibble_2.1.3
> nlme_3.1-139 gtable_0.3.0 [16] pkgconfig_2.0.3
> rlang_0.4.5 Matrix_1.2-17 foreach_1.5.0
> rstudioapi_0.11 [21] prodlim_2019.11.13 withr_2.1.2
> pROC_1.16.2 generics_0.0.2 recipes_0.1.10 [26]
> stats4_3.6.0 nnet_7.3-12 grid_3.6.0
> tidyselect_1.0.0 data.table_1.12.8 [31] glue_1.3.2
> R6_2.4.1 survival_2.44-1.1 lava_1.6.7
> reshape2_1.4.3 [36] purrr_0.3.3 magrittr_1.5
> ModelMetrics_1.2.2.2 splines_3.6.0 scales_1.1.0 [41]
> codetools_0.2-16 MASS_7.3-51.4 assertthat_0.2.1
> timeDate_3043.102 colorspace_1.4-1 [46] stringi_1.4.6
> munsell_0.5.0 crayon_1.3.4
Upvotes: 1
Views: 241
Reputation: 887511
We can use data.table
to update. unlist
the 'mylist', change the type
of the elements, replace the names
of the 'un1' by removing the prefix 'features.', create two vector of column names that are similar with the 'mydf' ('nm1') and different ('nm2'). Update the 'mydf' by joining on
the 'subject' and assigning (:=
) the values from the 'un1'
library(data.table)
library(stringr)
un1 <- type.convert(as.list(unlist(mylist)), as.is = TRUE)
names(un1) <- str_remove(names(un1), 'features\\.')
nm1 <- setdiff(intersect(names(mydf), names(un1)), 'subject')
nm2 <- setdiff(names(un1), names(mydf))
setDT(mydf)[as.data.table(un1), (nm1) := mget(paste0("i.", nm1)), on = .(subject)]
mydf[as.data.table(un1), (nm2) := mget(nm2), on = .(subject)]
mydf
# subject taken_time il6 pct crp age gender inpatient_readmission_time_rtd
#1: Subject 100 15 203.289386 732.07448 196115425 52 M 79.78819
#2: Subject 101 5 4.858269 25.67901 1073624 74 M 57.59068
#3: Subject 102 39 4.858269 119.02214 4092824 81 M 57.59068
#4: Subject 103 -21 14220.000000 120.04772 1457496 67 F 57.59068
# infection_flag temperature_value heartrate_value pH_result_time_rta gcst_value version organization
#1: 0 98.9 106 11.00000 15.00000 NA <NA>
#2: 0 98.9 61 85.50402 15.00000 NA <NA>
#3: 1 98.0 78 85.50402 15.00000 0 HOSP
#4: 1 101.3 91 85.50402 14.63769 NA <NA>
# time il6.value il6.time pct.value pct.time crp.value
#1: <NA> NA <NA> NA <NA> NA
#2: <NA> NA <NA> NA <NA> NA
#3: 2018-02-10T09:30:00Z 200 2018-02-10T09:30:00Z 1234 2018-02-10T09:30:00Z 1004
#4: <NA> NA <NA> NA <NA> NA
# crp.time
#1: <NA>
#2: <NA>
#3: 2018-02-10T09:30:00Z
#4: <NA>
Upvotes: 2