Ambuj
Ambuj

Reputation: 11

Read comma-separated file which also have comma within a column

I have a file where the separator between columns is comma (,). However, comma may also occur within columns, i.e. the 'Notes' column:

Id,Notes,Other_ID
100,This text looks good,1000
101,This text,have,comma,2000

I tried to read the csv:

r <- read.csv("test.csv", sep = ",")

As a result, I received it as :

Id.Notes.GUID
100,This text is good,1000
102,This text,have,comma,2000

which is incorrect, as I would like to have the output as

Id       Notes                  GUID
100      This text is good      1000
102      This text,have,comma  2000

Goal is to receive the data intact with the columns irrespective of comma present within a column and shouldn't work like a delimiter.

Thanks in Advance

Upvotes: 1

Views: 1791

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269471

1) read.pattern read.pattern will read the fields according to the provided regular expression. For reproducibility we have used Lines in the Note below but if the data is in a file replace text=Lines with something like "myfile.csv" .

library(gsubfn)
read.pattern(text = Lines, pattern = "^(.*?),(.*),(.*)$", header = TRUE, as.is = TRUE)

giving:

   Id                Notes Other_ID
1 100 This text looks good     1000
2 101 This text,have,comma     2000

2) Base R Read the data into a character vector and replace the first and last comma on each line with some character that does not otherwise occur such as semicolon. Then read that.

L.raw <- readLines(textConnection(Lines))
L.semi <- sub(",(.*),", ";\\1;", L.raw)
read.table(text = L.semi, header = TRUE, sep = ";", as.is = TRUE)

giving:

   Id                Notes Other_ID
1 100 This text looks good     1000
2 101 This text,have,comma     2000

3) gawk If you have a very large input file then it would likely be faster to do as much as possible outside of R. Using gawk we have the following. (On Windows install Rtools if you don't already have gawk and also make sure it is on your path or else refer to it using the entire pathname.) In the BEGIN block first is the number of commas to replace before the field with commas and last is the number of commas to replace after the field with commas. In this case the field with commas is the second of 3 fields so first = last = 1.

# generate test input

Lines <- "Id,Notes,Other_ID
100,This text looks good, 1000
101,This text,have,comma,2000"

cat(Lines, file = "ambuj.dat")

# gawk program to replace commas
ambuj.awk <- '
BEGIN { first = 1; last = 1 }
{ 
  nc = gsub(/,/, ",") # number of commas
  for(i = nc; i > nc-last; i--) $0 = gensub(/,/, ";", i) # replace last last commas
  for(i = 0; i < first; i++) sub(/,/, ";") # replace first first commas 
  print
}'
cat(ambuj.awk, file = "ambuj.awk")

read.csv(pipe("gawk -f ambuj.awk ambuj.dat"), sep = ";", quote = "",
 comment.char = "")

Also you could set colClasses= to speed it up a bit more.

Note

Lines <- "Id,Notes,Other_ID
100,This text looks good, 1000
101,This text,have,comma,2000"

Upvotes: 3

Related Questions