How to deal with missing values in jsonlite?

Question

I am dealing with data from a repeated measure design. The data consists of 4 measurements, each measuring about 100 variables. One of these variables is a JSON-array containing the results of a reaction task. The structure of this array is basically like this: [[prime, answer, reaction time], [prime, answer, reaction time], ...]

Each array consists of about 80 trials. My aim is to convert this array to separate columns, so that it looks like the example below:

prime1    answer1     reaction_time1    prime2     answer2   reaction_time2 ...
picture8  2           2398              2          1         1856
picture8  1           798               1          2         712
...

When working with an example data set, I managed to convert the array to a dataframe using the following code:

reaction_data <- data.frame(example_data$ID, example_data$TP,
                       jsonlite::stream_in(textConnection(gsub("\n", "", example_data$reaction_raw))))

As mentioned above, I am now dealing with data from repeated measurements, arranged in a long-format. So, for each person ID I have four measurements TP, and ideally, a complete set of data for all of the 100 variables, including the JSON array. However, in reality, of course I am dealing with drop-outs and missing values. This means that also the JSON-array is missing for some cases. Pretending my JSON-array would consist of only 3 trials, my current dataframe looks more or less like the example data below (ignoring all the other variables):

ID       TP         reaction_raw
1        1          [[picture8, 2, 2398], [picture2, 1, 1856], [picture1, 1, 897]]
1        2          [[picture8, 1, 798], [picture2, 2, 712], [picture1, 1, 423]]
1        3          NA
1        4          [[picture8, 1, 1278], [picture2, 1, 1712], [picture1, 1, 902]]
2        1          [[picture8, 2, 2015], [picture2, 1, 3820], [picture1, 2, 2719]]
2        2          [[picture8, 2, 3219], [picture2, 2, 1920], [picture1, 1, 1298]]
2        3          NA
2        4          NA
3        1          [[picture8, 1, 209], [picture2, 1, 382], [picture1, 2, 891]]
3        2          NA
3        3          [[picture8, 2, 781], [picture2, 1, 291], [picture1, 1, 2039]]
3        4          NA
...

When running now my code, I get the following error message:

lexical error: invalid char in json text.
                                       NA
                     (right here) ------^

I guess my code is not able to deal with the missing arrays. Does someone have an idea how to deal with this problem? Thank you in advance!

Allan Cameron · Accepted Answer

Assuming there are missing quotation marks in the example (otherwise the parser wouldn't get past the first occurence of picture8 without complaining), the way to handle this is to add an appropriately formatted json string that will translate as NA entries in the desired data structure.

So, for example, assuming your data looked like this:

df
#> # A tibble: 12 x 3
#>       ID    TP reaction_raw                                                     
#>                                                                  
#>  1     1     1 "[[\"picture8\", 2, 2398], [\"picture2\", 1, 1856], [\"picture1\~
#>  2     1     2 "[[\"picture8\", 1, 798], [\"picture2\", 2, 712], [\"picture1\",~
#>  3     1     3                                                              
#>  4     1     4 "[[\"picture8\", 1, 1278], [\"picture2\", 1, 1712], [\"picture1\~
#>  5     2     1 "[[\"picture8\", 2, 2015], [\"picture2\", 1, 3820], [\"picture1\~
#>  6     2     2 "[[\"picture8\", 2, 3219], [\"picture2\", 2, 1920], [\"picture1\~
#>  7     2     3                                                              
#>  8     2     4                                                              
#>  9     3     1 "[[\"picture8\", 1, 209], [\"picture2\", 1, 382], [\"picture1\",~
#> 10     3     2                                                              
#> 11     3     3 "[[\"picture8\", 2, 781], [\"picture2\", 1, 291], [\"picture1\",~
#> 12     3     4

Then you could write a json with null arguments for the missing values and write this in where reaction_raw is NA:

nulljson <- "[[\"picture8\", null, null], 
              [\"picture2\", null, null], 
              [\"picture1\", null, null]]"

df$reaction_raw[is.na(df$reaction_raw)] <- nulljson

Now you can shape the json into separate columns...

parser        <- function(x) as.vector(t(jsonlite::fromJSON(x)))
result_matrix <- do.call(rbind, lapply(df$reaction_raw, parser))
result_df     <- as.data.frame(result_matrix)
col_names     <- c("prime", "answer", "reaction_time")
col_names     <- paste0(col_names, rep(1:3, each = ncol(result_df)/3))
result_df     <- setNames(result_df,col_names)

...and join them to your main data frame:

cbind(df[1:2], result_df)
#>    ID TP   prime1 answer1 reaction_time1   prime2 answer2 reaction_time2
#> 1   1  1 picture8       2           2398 picture2       1           1856
#> 2   1  2 picture8       1            798 picture2       2            712
#> 3   1  3 picture8                picture2               
#> 4   1  4 picture8       1           1278 picture2       1           1712
#> 5   2  1 picture8       2           2015 picture2       1           3820
#> 6   2  2 picture8       2           3219 picture2       2           1920
#> 7   2  3 picture8                picture2               
#> 8   2  4 picture8                picture2               
#> 9   3  1 picture8       1            209 picture2       1            382
#> 10  3  2 picture8                picture2               
#> 11  3  3 picture8       2            781 picture2       1            291
#> 12  3  4 picture8                picture2               
#>      prime3 answer3 reaction_time3
#> 1  picture1       1            897
#> 2  picture1       1            423
#> 3  picture1               
#> 4  picture1       1            902
#> 5  picture1       2           2719
#> 6  picture1       1           1298
#> 7  picture1               
#> 8  picture1               
#> 9  picture1       2            891
#> 10 picture1               
#> 11 picture1       1           2039
#> 12 picture1

Data used

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L), TP = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L
), reaction_raw = c(
"[[\"picture8\", 2, 2398], [\"picture2\", 1, 1856], [\"picture1\", 1, 897]]", 
"[[\"picture8\", 1, 798], [\"picture2\", 2, 712], [\"picture1\", 1, 423]]", 
NA, "[[\"picture8\", 1, 1278], [\"picture2\", 1, 1712], [\"picture1\", 1, 902]]", 
"[[\"picture8\", 2, 2015], [\"picture2\", 1, 3820], [\"picture1\", 2, 2719]]", 
"[[\"picture8\", 2, 3219], [\"picture2\", 2, 1920], [\"picture1\", 1, 1298]]", 
NA, NA, "[[\"picture8\", 1, 209], [\"picture2\", 1, 382], [\"picture1\", 2, 891]]", 
NA, "[[\"picture8\", 2, 781], [\"picture2\", 1, 291], [\"picture1\", 1, 2039]]", 
NA)), row.names = c(NA, -12L), class = c("tbl_df", "tbl", "data.frame"
))

^{Created on 2020-07-07 by the reprex package (v0.3.0)}

How to deal with missing values in jsonlite?

Answers (1)

Related Questions