nclaidiere
nclaidiere

Reputation: 35

Keep double quotes in R read_csv to read JSON data

I have a csv file imported from a sql database, one of the column is made of JSON data. I used to be able to read_csv the file and then parse the JSON data using fromJSON. However, now when I read the file the double quotations marks around the JSON fields are automatically removed and I can't find a way to prevent that from happening.

Here is an example. CSV file looks like this (3 columns, 4 rows):

A,B,C
1,2,"[{"C1":"C6.bmp","C2": "C5.bmp"}]"
3,4,"[{"C1":"C6.bmp","C2": "C5.bmp"}]"
5,6,"[{"C1":"C6.bmp","C2": "C5.bmp"}]"
7,8,"[{"C1":"C6.bmp","C2": "C5.bmp"}]"

json_data<-read_csv(file="test.csv") produces

> json_data
# A tibble: 4 × 3
  A     B C                       
 <dbl> <dbl> <chr>                   
  1     2 [{C1:C6.bmp,C2: C5.bmp}]
  3     4 [{C1:C6.bmp,C2: C5.bmp}]
  5     6 [{C1:C6.bmp,C2: C5.bmp}]
  7     8 [{C1:C6.bmp,C2: C5.bmp}]

instead of the expected:

> json_data
# A tibble: 4 × 3
  A     B C                       
 <dbl> <dbl> <chr>                   
  1     2 [{"C1":"C6.bmp","C2": "C5.bmp"}]
  3     4 [{"C1":"C6.bmp","C2": "C5.bmp"}]
  5     6 [{"C1":"C6.bmp","C2": "C5.bmp"}]
  7     8 [{"C1":"C6.bmp","C2": "C5.bmp"}]

Without quotation marks column C cannot be passed on to fromJSON...

Upvotes: 0

Views: 129

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 270120

1) sub Read in the csv file, replace the first and last double quote with single quotes and then read in the file specifying single quote as the quote character. Single quote can be replaced with any character that does not occur in the file.

"test.csv" |>
  readLines() |>
  sub('"(.*)"', "'\\1'", x = _) |>
  read.csv(text = _, quote = "'")
##   A B                                C
## 1 1 2 [{"C1":"C6.bmp","C2": "C5.bmp"}]
## 2 3 4 [{"C1":"C6.bmp","C2": "C5.bmp"}]
## 3 5 6 [{"C1":"C6.bmp","C2": "C5.bmp"}]
## 4 7 8 [{"C1":"C6.bmp","C2": "C5.bmp"}]

2) sed Alternately we could preprocess the file outside of R using sed and pipe that to read.csv.

Below we have assumed

  • Windows cmd line escaping
  • sed, which comes with Rtools, is on your Windows PATH
  • the first double quote is to be replaced with @
  • the double quote at the end of the line is to be replaced with @.

We used @ instead single quote to avoid some problems with its interpretation on the cmd line. Any character not in the file will do as long as it does not have a special interpretation by the cmd line processor.

If using Linux sed is included but the escaping may need to be modified slightly.

read.csv(pipe('sed -e "s/\\"/@/" -e "s/\\"$/@/" test.csv'), quote = "@")

3) read.fwf This only works if the fields are in the same positions in every line. This is the case in the question's example.

read.fwf("test.csv", c(1, 1, 1, 2, 32))[c(1, 3, 5)]

Note

Generate input test.csv:

Lines <- 'A,B,C
1,2,"[{"C1":"C6.bmp","C2": "C5.bmp"}]"
3,4,"[{"C1":"C6.bmp","C2": "C5.bmp"}]"
5,6,"[{"C1":"C6.bmp","C2": "C5.bmp"}]"
7,8,"[{"C1":"C6.bmp","C2": "C5.bmp"}]"'
writeLines(Lines, "test.csv")

Upvotes: 0

jay.sf
jay.sf

Reputation: 73592

Using readLines(), then gsub commas between word characters to something e.g. an asterisk and strsplit at it.

> s <- gsub(readLines("foo.csv"), pat="(?<=\\w),", rep="*", perl=TRUE) |> 
+   strsplit('\\*')
> do.call('rbind.data.frame', s[-1]) |> setNames(s[[1]])
  A B                                  C
1 1 2 "[{"C1":"C6.bmp","C2": "C5.bmp"}]"
2 3 4 "[{"C1":"C6.bmp","C2": "C5.bmp"}]"
3 5 6 "[{"C1":"C6.bmp","C2": "C5.bmp"}]"
4 7 8 "[{"C1":"C6.bmp","C2": "C5.bmp"}]"

Upvotes: 0

user2554330
user2554330

Reputation: 44977

If you can control how that file is created, you could get it to use single quotes around the JSON strings, i.e.

A,B,C
1,2,'[{"C1":"C6.bmp","C2": "C5.bmp"}]'
3,4,'[{"C1":"C6.bmp","C2": "C5.bmp"}]'
5,6,'[{"C1":"C6.bmp","C2": "C5.bmp"}]'
7,8,'[{"C1":"C6.bmp","C2": "C5.bmp"}]'

and then read.csv or read_csv will read it properly as long as you specify quote = "'":

> read.csv("~/temp/test.csv", quote="'")
  A B                                C
1 1 2 [{"C1":"C6.bmp","C2": "C5.bmp"}]
2 3 4 [{"C1":"C6.bmp","C2": "C5.bmp"}]
3 5 6 [{"C1":"C6.bmp","C2": "C5.bmp"}]
4 7 8 [{"C1":"C6.bmp","C2": "C5.bmp"}]

If you can't control that, you might be able to edit the file to look like that; in your sample, the only place "[ appears is at the start of the JSON, and the only place ]" appears is at the end. If this is also true for your real data, then you could make that edit pretty easily.

If that's not feasible, then you could read it with quote = "", which will leave all the quotes in place. The trouble with this is that the comma in the JSON will be taken to be a column separator, so you'll have a lot of fixups to do afterwards.

Upvotes: 0

Related Questions