Reputation: 21
I have a CSV file that also has JSON inside it. I am trying to get Company, Type, Driver into a dataframe I don't want to parse it as the below CSV is just a sample, I have a lot more columns with various json key/values (some are absent, not in a specific order, and lots and lots of them).
my sub-short_csvjson.csv CSV file sample:
Married,Transportation,Color
YES,"{""Company"":""GTS"",""Type"":""Limo""}",White
,"{""Driver"":""John""}",Green
NO,"{""Type"":""Van"",""Driver"":""John""}",
What can I do (outside of parsing) to have a dataframe with
my_data$Married
my_data$Transportation.Company
my_data$Transportation.Type
my_data$Transportation.Driver
my_data$Color
Thanks
Upvotes: 2
Views: 521
Reputation: 10483
Here is one solution I can think of that gets you what you need using jsonlite
package and row-by-row processing:
ASSUME df
looks like this using read.csv
and stringsAsFactors = FALSE
:
df
Married Transportation Color
1 YES {"Company":"GTS","Type":"Limo"} White
2 {"Driver":"John"} Green
3 NO {"Type":"Van","Driver":"John"}
You can do this:
library(jsonlite)
l <- lapply(df$Transportation, fromJSON)
n <- unique(unlist(sapply(l, names)))
df[, n] <- lapply(n, function(x) sapply(l, function(y) y[[x]]))
To get this:
df
Married Transportation Color Company Type Driver
1 YES {"Company":"GTS","Type":"Limo"} White GTS Limo NULL
2 {"Driver":"John"} Green NULL NULL John
3 NO {"Type":"Van","Driver":"John"} NULL Van John
Not sure if there is a more efficient way.
EDIT BASED ON ADDED INFORMATION ABOUT MALFORMED JSON IN REAL DATA
In case there is malformed JSON in the original within the Transportation
column, here is one way to solve it:
Original data frame as follows:
df <- read.table(text = 'Married,Transportation,Color
YES,"{""Company"":""GTS"",""Type"":""Limo""}",White
,"{""Driver"":""John""}",Green
NO,"{""Type"":""Van"",""Driver"":""John""}",',
header = TRUE, sep = ',', stringsAsFactors = FALSE)
Row bind and extra row with malformed JSON with an extra '"' character:
df <- rbind(df, data.frame(Married = 'NO',
Transportation = '{"Company": ""GTLS"}',
Color = 'Red'))
New df looks like this (See malformed JSON in row 4):
Married Transportation Color
1 YES {"Company":"GTS","Type":"Limo"} White
2 {"Driver":"John"} Green
3 NO {"Type":"Van","Driver":"John"}
4 NO {"Company": ""GTLS"} Red
Now, use this to get all the nested JSON into separate columns:
l <- lapply(df$Transportation, function(x) tryCatch({fromJSON(x)}, error = function(e) NA))
n <- unique(unlist(sapply(l, names)))
df[, n] <- lapply(n, function(x)
sapply(l, function(y)
if (!is.null(names(y))) y[[x]]))
Output is as follows:
Married Transportation Color Company Type Driver
1 YES {"Company":"GTS","Type":"Limo"} White GTS Limo NULL
2 {"Driver":"John"} Green NULL NULL John
3 NO {"Type":"Van","Driver":"John"} NULL Van John
4 NO {"Company": ""GTLS"} Red NULL NULL NULL
Upvotes: 2