Reputation: 11
I have a file where the separator between columns is comma (,
). However, comma may also occur within columns, i.e. the 'Notes' column:
Id,Notes,Other_ID
100,This text looks good,1000
101,This text,have,comma,2000
I tried to read the csv:
r <- read.csv("test.csv", sep = ",")
As a result, I received it as :
Id.Notes.GUID
100,This text is good,1000
102,This text,have,comma,2000
which is incorrect, as I would like to have the output as
Id Notes GUID
100 This text is good 1000
102 This text,have,comma 2000
Goal is to receive the data intact with the columns irrespective of comma present within a column and shouldn't work like a delimiter.
Thanks in Advance
Upvotes: 1
Views: 1791
Reputation: 269471
1) read.pattern read.pattern
will read the fields according to the provided regular expression. For reproducibility we have used Lines in the Note below but if the data is in a file replace text=Lines
with something like "myfile.csv"
.
library(gsubfn)
read.pattern(text = Lines, pattern = "^(.*?),(.*),(.*)$", header = TRUE, as.is = TRUE)
giving:
Id Notes Other_ID
1 100 This text looks good 1000
2 101 This text,have,comma 2000
2) Base R Read the data into a character vector and replace the first and last comma on each line with some character that does not otherwise occur such as semicolon. Then read that.
L.raw <- readLines(textConnection(Lines))
L.semi <- sub(",(.*),", ";\\1;", L.raw)
read.table(text = L.semi, header = TRUE, sep = ";", as.is = TRUE)
giving:
Id Notes Other_ID
1 100 This text looks good 1000
2 101 This text,have,comma 2000
3) gawk If you have a very large input file then it would likely be faster to do as much as possible outside of R. Using gawk
we have the following. (On Windows install Rtools if you don't already have gawk
and also make sure it is on your path or else refer to it using the entire pathname.) In the BEGIN
block first
is the number of commas to replace before the field with commas and last
is the number of commas to replace after the field with commas. In this case the field with commas is the second of 3 fields so first = last = 1.
# generate test input
Lines <- "Id,Notes,Other_ID
100,This text looks good, 1000
101,This text,have,comma,2000"
cat(Lines, file = "ambuj.dat")
# gawk program to replace commas
ambuj.awk <- '
BEGIN { first = 1; last = 1 }
{
nc = gsub(/,/, ",") # number of commas
for(i = nc; i > nc-last; i--) $0 = gensub(/,/, ";", i) # replace last last commas
for(i = 0; i < first; i++) sub(/,/, ";") # replace first first commas
print
}'
cat(ambuj.awk, file = "ambuj.awk")
read.csv(pipe("gawk -f ambuj.awk ambuj.dat"), sep = ";", quote = "",
comment.char = "")
Also you could set colClasses=
to speed it up a bit more.
Lines <- "Id,Notes,Other_ID
100,This text looks good, 1000
101,This text,have,comma,2000"
Upvotes: 3