Anubhav Dikshit
Anubhav Dikshit

Reputation: 1829

Read messy json in R

I have a csv file with the following structure:

Input

{"eid":"START","ver":"3.0","ets":1514764800238}}
{"eid":"INTERACT","ver":"3.0","ets":1514764820546}}
{"eid":"IMPRESSION","ver":"3.0","ets":895732}}
{"eid":"IMPRESSION","ver":"3.0","ets":245636}}
{"eid":"INTERACT","ver":"3.0","ets":535235423525}}

As you can see, its not a valid json, for the above to be valid json, the structure should be the following:

Expected Output

[{"eid":"START","ver":"3.0","ets":1514764800238},
{"eid":"INTERACT","ver":"3.0","ets":1514764820546},
{"eid":"IMPRESSION","ver":"3.0","ets":895732},
{"eid":"IMPRESSION","ver":"3.0","ets":245636},
{"eid":"INTERACT","ver":"3.0","ets":535235423525}]

Problem:

I would like to ideally read the file and fix it and save as a JSON, that is

  1. Replace "}}" with "}," everywhere except the last line
  2. Append "[" and "]" at the beginning and end of the file

I tried using fromJSON(rjson), read_delim, but I could not read it.

Thanks in advance

Upvotes: 0

Views: 101

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269586

Note that this question is nearly a duplicate of Extraction of different types of variables from a large list

Aside from reading it in and running fromJSON (jsonlite package) a single line of base code can convert it to valid JSON (in variable json).

  • replace "}}" with "}" using sub on each input line,
  • insert commas between lines using toString and
  • surround that with "[" and "]" using c

Code:

library(jsonlite)

L <- readLines("test.json")
json <- c("[", toString(sub("}}", "}", L)), "]")
fromJSON(json)

giving:

         eid ver          ets
1      START 3.0 1.514765e+12
2   INTERACT 3.0 1.514765e+12
3 IMPRESSION 3.0 8.957320e+05
4 IMPRESSION 3.0 2.456360e+05
5   INTERACT 3.0 5.352354e+11

Variation

This could alternately be expressed as a pipeline giving the same output:

library(jsonlite)
library(magrittr)

"test.json" %>%
  sub("}}", "}", .) %>%
  toString %>%
  c("[", ., "]") %>%
  fromJSON

Note

Test input is generated with the code below:

Lines <- c('{"eid":"START","ver":"3.0","ets":1514764800238}}',
'{"eid":"INTERACT","ver":"3.0","ets":1514764820546}}',
'{"eid":"IMPRESSION","ver":"3.0","ets":895732}}',
'{"eid":"IMPRESSION","ver":"3.0","ets":245636}}',
'{"eid":"INTERACT","ver":"3.0","ets":535235423525}}')

writeLines(Lines, "test.json")

Upvotes: 0

hrbrmstr
hrbrmstr

Reputation: 78792

Manual find/replace is a terrible, terrible, terrible suggestion for reproducible workflows.

One option — assuming there really is a }} at the end of each line and the file is in /tmp/badlines:

library(magrittr)
library(ndjson)

readLines("/tmp/badlines") %>%
  sub("\\}$", "", .) %>% 
  ndjson::flatten(cls = "tbl")
## # A tibble: 5 x 3
##   eid            ets ver  
##   <chr>        <dbl> <chr>
## 1 START      1.51e12 3.0  
## 2 INTERACT   1.51e12 3.0  
## 3 IMPRESSION 8.96e 5 3.0  
## 4 IMPRESSION 2.46e 5 3.0  
## 5 INTERACT   5.35e11 3.0  

Upvotes: 2

Related Questions