Reputation: 347
I have a large dataset which has both unmatched quotes and delimiters (semicolon) within fields. Here is an example:
"a";"b";"c";"d"
"a";"b;c";"c";"d"
"a";"b"c";"c";"d"
I save this data as test_SO.txt and read it with read.csv as follows:
df <- read.csv("/Users/Al/Documents/test_SO.txt", header=F, quote = "", sep = ";", allowEscapes=T)
V1 V2 V3 V4 V5
1 "a" "b" "c" "d"
2 "a" "b c" "c" "d"
3 "a" "b"c" "c" "d"
df <- read.csv("/Users/Al/Documents/test_SO.txt", header=F, quote = "\"", sep = ";", allowEscapes=T)
I would like to read this data as follows:
V1 V2 V3 V4
1 a b c d
2 a b c c d
3 a b c c d
The problem is that when I escape quotes I cannot escape tabs and vice-versa.
I have tried the solution "readLines, replace delimiter and read", but my data is too large and the function too slow. Is there a way to do that within read.csv itself?
Upvotes: 1
Views: 92
Reputation: 20811
If you don't mind a few steps
(t1 <- gsub('\";\"', '|', '"a";"b";"c";"d"
"a";"b;c";"c";"d"
"a";"b"c";"c";"d"'))
# [1] "\"a|b|c|d\"\n\"a|b;c|c|d\"\n\"a|b\"c|c|d\""
(t2 <- gsub('\"', '', t1))
# [1] "a|b|c|d\na|b;c|c|d\na|bc|c|d"
(t3 <- gsub('(\\w);?(\\w)', '\\1 \\2', t2))
# [1] "a|b|c|d\na|b c|c|d\na|b c|c|d"
read.table(text = t3, sep = '|')
# V1 V2 V3 V4
# 1 a b c d
# 2 a b c c d
# 3 a b c c d
Upvotes: 1