kim
kim

Reputation: 567

How to append json data into dataframe and update its value in R?

I want to append json data to dataframe by looking at whether it does have common key or not, if yes, I want to do value update. So far, I used jsonlite::fromJSON() to get R list from raw json data, can add json data as new row there is no match with dataframe. I am wondering how am I gonna add value from json data if it has common key with dataframe. Any way to do this task in R?

reproducible example

here is the output of jsonlite::fromJSON(jsondata):

mylist = list(version = 0L, subject = "Subject 102", organization = "HOSP", 
    time = "2018-02-10T09:30:00Z", features = list(age = 81L, 
        gender = "M", il6 = list(value = 200L, time = "2018-02-10T09:30:00Z"), 
        pct = list(value = 1234L, time = "2018-02-10T09:30:00Z"), 
        crp = list(value = 1004L, time = "2018-02-10T09:30:00Z")))

here is dataframe:

mydf = structure(list(subject= c('Subject 100', 'Subject 101','Subject 102','Subject 103'), taken_time = c(15L, 5L, 39L, -21L), il6= c(203.2893857, 
4.858269406, 4.858269406, 14220), pct = c(732.074484, 25.67901235, 
119.0221384, 120.0477168), crp = c(196115424.7, 1073624.455, 
4092823.885, 1457496.474), age = c(52L, 74L, 81L, 67L), gender = structure(c(2L, 
2L, 2L, 1L), .Label = c("F", "M"), class = "factor"), inpatient_readmission_time_rtd = c(79.78819444, 
57.59068053, 57.59068053, 57.59068053), infection_flag = c(0L, 
0L, 1L, 1L), temperature_value = c(98.9, 98.9, 98, 101.3), heartrate_value = c(106, 
61, 78, 91), pH_result_time_rta = c(11, 85.50402145, 85.50402145, 
85.50402145), gcst_value = c(15, 15, 15, 14.63769293)), row.names = c(NA, 
4L), class = "data.frame")

current attempt:

based on @akrun suggestion, here is what I tried:

unlist(mylist) %>% setNames(str_remove(names(.), 'features\\.')) %>% 
  as.list %>% as_tibble %>% type.convert(as.is = TRUE) %>% 
  bind_rows(mydf, .) %>% select(-contains(c(".time", "version", "organization", "taken_time"))) %>% as.data.frame()

but I still can't able to add value in json data to dataframe by finding common key such as person_id, crp, pct, il6, then I just want to do value update. How can I do that easily in R? any idea?

I think using dplyr can do this manipulation, but I don't know how to get going from above attempt. any idea?

desired output:

this is the desired output that I want to get:

structure(list(person_id = structure(1:4, .Label = c("Subject 100", 
"Subject 101", "Subject 102", "Subject 103"), class = "factor"), 
    taken_time = c(15L, 5L, 39L, -21L), il6 = c(203.2893857, 
    4.858269406, 200, 14220), pct = c(732.074484, 25.67901235, 
    1234, 120.0477168), crp = c(196115424.7, 1073624.455, 1004, 
    1457496.474), age = c(52L, 74L, 81L, 67L), gender = structure(c(2L, 
    2L, 2L, 1L), .Label = c("F", "M"), class = "factor"), inpatient_readmission_time_rtd = c(79.78819444, 
    57.59068053, 57.59068053, 57.59068053), infection_flag = c(0L, 
    0L, 1L, 1L), temperature_value = c(98.9, 98.9, 98, 101.3), 
    heartrate_value = c(106L, 61L, 78L, 91L), pH_result_time_rta = c(11, 
    85.50402145, 85.50402145, 85.50402145), gcst_value = c(15, 
    15, 15, 14.63769293)), class = "data.frame", row.names = c(NA, 
-4L))

how can I get above desired dataframe? Is there any way to achieve this by using dplyr utilities? thanks

update: R sessioninfo:

here is the output of R sessioninfo:

> > sessionInfo() R version 3.6.0 (2019-04-26) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build
> 18362)
> 
> Matrix products: default
> 
> Random number generation:  RNG:     Mersenne-Twister   Normal: 
> Inversion   Sample:  Rounding    locale: [1] LC_COLLATE=English_United
> States.1252  LC_CTYPE=English_United States.1252    [3]
> LC_MONETARY=English_United States.1252 LC_NUMERIC=C                   
> [5] LC_TIME=English_United States.1252    
> 
> attached base packages: [1] stats     graphics  grDevices utils    
> datasets  methods   base     
> 
> other attached packages: [1] stringr_1.4.0   ranger_0.12.1  
> dplyr_0.8.5     caret_6.0-86    ggplot2_3.3.0   lattice_0.20-38 [7]
> jsonlite_1.6.1 
> 
> loaded via a namespace (and not attached):  [1] Rcpp_1.0.3          
> pillar_1.4.3         compiler_3.6.0       gower_0.2.1         
> plyr_1.8.6            [6] iterators_1.0.12     class_7.3-15        
> tools_3.6.0          rpart_4.1-15         ipred_0.9-9          [11]
> lubridate_1.7.4      lifecycle_0.2.0      tibble_2.1.3        
> nlme_3.1-139         gtable_0.3.0         [16] pkgconfig_2.0.3     
> rlang_0.4.5          Matrix_1.2-17        foreach_1.5.0       
> rstudioapi_0.11      [21] prodlim_2019.11.13   withr_2.1.2         
> pROC_1.16.2          generics_0.0.2       recipes_0.1.10       [26]
> stats4_3.6.0         nnet_7.3-12          grid_3.6.0          
> tidyselect_1.0.0     data.table_1.12.8    [31] glue_1.3.2          
> R6_2.4.1             survival_2.44-1.1    lava_1.6.7          
> reshape2_1.4.3       [36] purrr_0.3.3          magrittr_1.5        
> ModelMetrics_1.2.2.2 splines_3.6.0        scales_1.1.0         [41]
> codetools_0.2-16     MASS_7.3-51.4        assertthat_0.2.1    
> timeDate_3043.102    colorspace_1.4-1     [46] stringi_1.4.6       
> munsell_0.5.0        crayon_1.3.4

Upvotes: 1

Views: 241

Answers (1)

akrun
akrun

Reputation: 887511

We can use data.table to update. unlist the 'mylist', change the type of the elements, replace the names of the 'un1' by removing the prefix 'features.', create two vector of column names that are similar with the 'mydf' ('nm1') and different ('nm2'). Update the 'mydf' by joining on the 'subject' and assigning (:=) the values from the 'un1'

library(data.table)
library(stringr)
un1 <- type.convert(as.list(unlist(mylist)), as.is = TRUE)
names(un1) <- str_remove(names(un1), 'features\\.')
nm1 <- setdiff(intersect(names(mydf), names(un1)), 'subject')
nm2 <- setdiff(names(un1), names(mydf))
setDT(mydf)[as.data.table(un1), (nm1) := mget(paste0("i.", nm1)), on = .(subject)]
mydf[as.data.table(un1), (nm2) := mget(nm2), on = .(subject)]
mydf
#       subject taken_time          il6       pct       crp age gender inpatient_readmission_time_rtd
#1: Subject 100         15   203.289386 732.07448 196115425  52      M                       79.78819
#2: Subject 101          5     4.858269  25.67901   1073624  74      M                       57.59068
#3: Subject 102         39     4.858269 119.02214   4092824  81      M                       57.59068
#4: Subject 103        -21 14220.000000 120.04772   1457496  67      F                       57.59068
#   infection_flag temperature_value heartrate_value pH_result_time_rta gcst_value version organization
#1:              0              98.9             106           11.00000   15.00000      NA         <NA>
#2:              0              98.9              61           85.50402   15.00000      NA         <NA>
#3:              1              98.0              78           85.50402   15.00000       0         HOSP
#4:              1             101.3              91           85.50402   14.63769      NA         <NA>
#                   time il6.value             il6.time pct.value             pct.time crp.value
#1:                 <NA>        NA                 <NA>        NA                 <NA>        NA
#2:                 <NA>        NA                 <NA>        NA                 <NA>        NA
#3: 2018-02-10T09:30:00Z       200 2018-02-10T09:30:00Z      1234 2018-02-10T09:30:00Z      1004
#4:                 <NA>        NA                 <NA>        NA                 <NA>        NA
#               crp.time
#1:                 <NA>
#2:                 <NA>
#3: 2018-02-10T09:30:00Z
#4:                 <NA>

Upvotes: 2

Related Questions