Reputation: 7928
I have a very large csv files that has the following structure
123, NAME1, [email protected]
111, NAME2, [email protected]
The problem is that some names have a comma, something like
699, FIRST M. LAST, Jr., [email protected]
Is there a way to solve this? The original csv has aprox 80k entries, so doing it by hand is not possible.
Thanks!
Upvotes: 2
Views: 1287
Reputation: 89057
Here is an R solution using a regular expression:
file <- textConnection("123, NAME1, [email protected]
111, NAME2, [email protected]
699, FIRST M. LAST, Jr., [email protected]")
lines <- readLines(file)
pattern <- "^(\\d+), (.*), \\b(.*)$"
matches <- regexec(pattern, lines)
bad.rows <- which(sapply(matches, length) == 1L)
if (length(bad.rows) > 0L) stop(paste("bad row: ", lines[bad.rows]))
data <- regmatches(lines, matches)
as.data.frame(matrix(unlist(data), ncol = 4L, byrow = TRUE)[, -1L])
# V1 V2 V3
# 1 123 NAME1 [email protected]
# 2 111 NAME2 [email protected]
# 3 699 FIRST M. LAST, Jr. [email protected]
Upvotes: 1
Reputation: 21492
This is a commonly asked question, and one of the better answers is to use scan
or readLines
to load the whole mess into R
and then make use of gsub
or other regex tools to split the lines into the desired elements.
EDIT: see flodel's answer for details on this approach
Upvotes: 0
Reputation: 7928
I use this simple python script to convert my data
import sys
for line in open(sys.argv[1]):
x = line.split(',')
x = [token.strip() for token in x]
x = [x[0], '"%s"' % (",".join(x[1:-1])), x[-1]]
print ";".join(x)
To run it
python conv.py input.txt > output.txt
After that, I can read it in R without a problem.
Thanks!
Upvotes: 0
Reputation: 121568
In 2 steps , you can do this for example:
## read using `fill=TRUE`
dat <- read.table(text='
123, NAME1, [email protected]
111, NAME2, [email protected]
699, FIRST M. LAST, Jr., [email protected]',sep=',',
fill=TRUE,
header=FALSE,stringsAsFactors=FALSE)
## concatenate names when they contain a comma
dat$V3 <- ifelse(nchar(dat$V4)>0,paste(dat$V3,dat$V4,sep=','),dat$V3)
dat[,-4]
V1 V2 V3
1 123 NAME1 [email protected]
2 111 NAME2 [email protected]
3 699 FIRST M. LAST Jr., [email protected]
Upvotes: 2