Li haonan
Li haonan

Reputation: 630

R: use read.table to load data with doublequotes

My csv version data is like:

name,words,name
John, "He says:"I love it!"", 18

At first, I tried to load data with

data <- read.table("data.csv",header = T,sep = ',',quote = "",stringsAsFactors = FALSE)

And error is:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1 did not have 3 elements

Well, I can understand that, since R messes up with many doublequotes.

And I fixed it with

data <- read.table("data.csv",header = T,sep = ',',quote = "\"",,stringsAsFactors = FALSE) #change the name of the output file 

However, I can't figure why is it so, how does R know which doublquotes he should stop at?

Upvotes: 1

Views: 223

Answers (1)

Aaron - mostly inactive
Aaron - mostly inactive

Reputation: 37754

Well, that's an interesting data format -- and interesting behavior. The help page says "See scan for the behaviour on quotes embedded in quotes," but I didn't see anything useful in that help page, so I tried some things.

What I believe the quote argument does is to tell R to ignore any sep elements that occur between quotes, and also to remove any quote elements (because that's meant to be used only for delimiting columns, not as data). So this works for you only because you don't have any commas after the second quote in your words column.

Here are four examples.

no commas in the quotes (your example)

name,words,name
John, "He says:"I love it!"", 18

Interestingly, this example works for me in both versions of your code. The first leaves in all the quotes and the second removes them.

read.table("data.csv", header = TRUE, sep = ',', quote = "", stringsAsFactors = FALSE)
##   name                   words name.1
## 1 John  "He says:"I love it!""     18
read.table("data.csv", header = TRUE, sep = ',', quote = "\"", stringsAsFactors = FALSE)
##   name               words name.1
## 1 John  He says:I love it!     18

comma after the first quote

name,words,name
John, "He says, "I love it!"", 18

Here the first version (quote="") separates the row into four columns, not three, based on the commas, and uses the extra column as the rownames. The second version ignores the added comma, but also removes the quotes around the actual quotation.

read.table("text.csv", header = TRUE, sep = ',', quote = "", stringsAsFactors = FALSE)
##           name          words name.1
## John  "He says  "I love it!""     18
read.table("text.csv", header = TRUE, sep = ',', quote = "\"", stringsAsFactors = FALSE)
##   name                words name.1
## 1 John  He says, I love it!     18

comma after the second quote

name,words,name
John, "He says: "I love it, do you?"", 18

Here both versions do almost the same thing (four columns) because the comma isn't between a paired quote. The first keeps the quotes, the second doesn't.

read.table("text.csv", header = TRUE, sep = ',', quote = "", stringsAsFactors = FALSE)
##                       name      words name.1
## John  "He says: "I love it  do you?""     18
read.table("text.csv", header = TRUE, sep = ',', quote = "\"", stringsAsFactors = FALSE)
##                     name    words name.1
## John  He says: I love it  do you?     18

commas in between both quotes

name,words,name
John, "He says, "I love it, do you?"", 18

Here the first one doesn't work, as it finds three column names but five columns in the first row. The second skips the first comma, but not the second, so again separates it into four columns, and uses the extra as the row name.

read.table("text.csv", header = TRUE, sep = ',', quote = "", stringsAsFactors = FALSE)
## Error in read.table("text.csv", header = TRUE, sep = ",", quote = "",  : 
##  more columns than column names
read.table("text.csv", header = TRUE, sep = ',', quote = "\"", stringsAsFactors = FALSE)
##                     name    words name.1
## John  He says, I love it  do you?     18

Finally, all of these examples only have one line; if you have more than one line and they parse into different numbers of columns, you'll get an error like the one you got, except for the first line at which the number of columns differ.

What surprises me about your error is that it happens on line 1; you'd get this error if R thought you had less than three columns in that line (the number it found in the header row), but on my system, anyway, it finds three elements in that line.

Upvotes: 1

Related Questions