user8010356
user8010356

Reputation: 21

suggestion to read JSON inside a CSV file without parsing in R

I have a CSV file that also has JSON inside it. I am trying to get Company, Type, Driver into a dataframe I don't want to parse it as the below CSV is just a sample, I have a lot more columns with various json key/values (some are absent, not in a specific order, and lots and lots of them).

my sub-short_csvjson.csv CSV file sample:

Married,Transportation,Color
YES,"{""Company"":""GTS"",""Type"":""Limo""}",White
,"{""Driver"":""John""}",Green
NO,"{""Type"":""Van"",""Driver"":""John""}",

What can I do (outside of parsing) to have a dataframe with

my_data$Married
my_data$Transportation.Company
my_data$Transportation.Type
my_data$Transportation.Driver
my_data$Color

Thanks

Upvotes: 2

Views: 521

Answers (1)

Gopala
Gopala

Reputation: 10483

Here is one solution I can think of that gets you what you need using jsonlite package and row-by-row processing:

ASSUME df looks like this using read.csv and stringsAsFactors = FALSE:

df
  Married                  Transportation Color
1     YES {"Company":"GTS","Type":"Limo"} White
2                       {"Driver":"John"} Green
3      NO  {"Type":"Van","Driver":"John"}      

You can do this:

library(jsonlite)

l <- lapply(df$Transportation, fromJSON)
n <- unique(unlist(sapply(l, names)))
df[, n] <- lapply(n, function(x) sapply(l, function(y) y[[x]]))

To get this:

df
  Married                  Transportation Color Company Type Driver
1     YES {"Company":"GTS","Type":"Limo"} White     GTS Limo   NULL
2                       {"Driver":"John"} Green    NULL NULL   John
3      NO  {"Type":"Van","Driver":"John"}          NULL  Van   John

Not sure if there is a more efficient way.

EDIT BASED ON ADDED INFORMATION ABOUT MALFORMED JSON IN REAL DATA

In case there is malformed JSON in the original within the Transportation column, here is one way to solve it:

Original data frame as follows:

df <- read.table(text = 'Married,Transportation,Color
YES,"{""Company"":""GTS"",""Type"":""Limo""}",White
,"{""Driver"":""John""}",Green
NO,"{""Type"":""Van"",""Driver"":""John""}",',
header = TRUE, sep = ',', stringsAsFactors = FALSE)

Row bind and extra row with malformed JSON with an extra '"' character:

df <- rbind(df, data.frame(Married = 'NO',
                           Transportation = '{"Company": ""GTLS"}',
                           Color = 'Red'))

New df looks like this (See malformed JSON in row 4):

  Married                  Transportation Color
1     YES {"Company":"GTS","Type":"Limo"} White
2                       {"Driver":"John"} Green
3      NO  {"Type":"Van","Driver":"John"}      
4      NO            {"Company": ""GTLS"}   Red

Now, use this to get all the nested JSON into separate columns:

l <- lapply(df$Transportation, function(x) tryCatch({fromJSON(x)}, error = function(e) NA))
n <- unique(unlist(sapply(l, names)))
df[, n] <- lapply(n, function(x)
                    sapply(l, function(y)
                             if (!is.null(names(y))) y[[x]]))

Output is as follows:

  Married                  Transportation Color Company Type Driver
1     YES {"Company":"GTS","Type":"Limo"} White     GTS Limo   NULL
2                       {"Driver":"John"} Green    NULL NULL   John
3      NO  {"Type":"Van","Driver":"John"}          NULL  Van   John
4      NO            {"Company": ""GTLS"}   Red    NULL NULL   NULL

Upvotes: 2

Related Questions