Reputation: 20409
I have the CSV file like below:
data,key
"VA1,VA2,20140524,,0,0,5969,20140523134902,S7,S1147,140,20140523134902,m/t",4503632376496128
"VA2,VA3,20140711,,0,0,8824,20140601095714,S1,S6402,175,20140601095839,m/t",4503643113914368
I try to read it with R, but I don't need key
value and data
value should be read to separate columns. With the following code I get almost what I need:
data <- read.csv(fileCSV, header = FALSE, sep = ",", skip = 1, comment.char = "", quote = "")
I skip header line there (skip = 1
), say that I don't have it (header = FALSE
), and say that I don't have quotes (quote = ""
). But in result I get quote characters in V1 and V13 columns and extra V14 column:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 "VA1 VA2 20140524 NA 0 0 5969 2.014121e+13 S7 S1147 140 2.014121e+13 m/t" 4.503608e+15
Should I delete it somehow after reading csv? Or, is there any better way to read such csv files?
Upd. I use the following approach to delete quotes:
data[,"V1"] = sub("^\"", "", data[,"V1"])
data[,"V13"] = sub("\"$", "", data[,"V13"])
But factor
type is changed to character
for these columns.
Upvotes: 0
Views: 201
Reputation: 99331
How about a system command with fread()
?
writeLines(
'data,key
"VA1,VA2,20140524,,0,0,5969,20140523134902,S7,S1147,140,20140523134902,m/t",4503632376496128
"VA2,VA3,20140711,,0,0,8824,20140601095714,S1,S6402,175,20140601095839,m/t",4503643113914368', "x.txt"
)
require(bit64)
data.table::fread("cat x.txt | rev | cut -d '\"' -f2 | rev | tail -n +2")
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1: VA1 VA2 20140524 NA 0 0 5969 20140523134902 S7 S1147 140 20140523134902 m/t
# 2: VA2 VA3 20140711 NA 0 0 8824 20140601095714 S1 S6402 175 20140601095839 m/t
Here's a test on the two methods, as requested.
## 150k lines
writeLines(c("data,key\n", rep_len(
'"VA1,VA2,20140524,,0,0,5969,20140523134902,S7,S1147,140,20140523134902,m/t",4503632376496128\n', 1.5e5)),
"test.txt"
)
## fread() in well under 1 second (with bit64 loaded)
system.time({
dt <- data.table::fread(
"cat test.txt | rev | cut -d '\"' -f2 | rev | grep -e '^V'"
)
})
# user system elapsed
# 0.945 0.108 0.547
## your current read.csv() method in just over two seconds
system.time({
df <- read.csv("test.txt", header = FALSE, sep = ",", skip = 1, comment.char = "", quote = "")
df[,"V1"] = sub("^\"", "", df[,"V1"])
df[,"V13"] = sub("\"$", "", df[,"V13"])
})
# user system elapsed
# 2.134 0.000 2.129
dim(dt)
# [1] 150000 13
dim(df)
# [1] 150000 14
Upvotes: 2