Ignacio
Ignacio

Reputation: 7928

extra commas in csv causing problems

I have a very large csv files that has the following structure

123, NAME1, [email protected]
111, NAME2, [email protected]

The problem is that some names have a comma, something like

699, FIRST M. LAST, Jr., [email protected]

Is there a way to solve this? The original csv has aprox 80k entries, so doing it by hand is not possible.

Thanks!

Upvotes: 2

Views: 1287

Answers (4)

flodel
flodel

Reputation: 89057

Here is an R solution using a regular expression:

file <- textConnection("123, NAME1, [email protected]
111, NAME2, [email protected]
699, FIRST M. LAST, Jr., [email protected]")

lines   <- readLines(file)
pattern <- "^(\\d+), (.*), \\b(.*)$"
matches <- regexec(pattern, lines)

bad.rows <- which(sapply(matches, length) == 1L)
if (length(bad.rows) > 0L) stop(paste("bad row: ", lines[bad.rows]))

data <- regmatches(lines, matches)
as.data.frame(matrix(unlist(data), ncol = 4L, byrow = TRUE)[, -1L])

#    V1                 V2                 V3
# 1 123              NAME1 [email protected]
# 2 111              NAME2   [email protected]
# 3 699 FIRST M. LAST, Jr. [email protected]

Upvotes: 1

Carl Witthoft
Carl Witthoft

Reputation: 21492

This is a commonly asked question, and one of the better answers is to use scan or readLines to load the whole mess into R and then make use of gsub or other regex tools to split the lines into the desired elements.

EDIT: see flodel's answer for details on this approach

Upvotes: 0

Ignacio
Ignacio

Reputation: 7928

I use this simple python script to convert my data

import sys

for line in open(sys.argv[1]):
    x = line.split(',')
    x = [token.strip() for token in x]
    x = [x[0], '"%s"' % (",".join(x[1:-1])), x[-1]]
    print ";".join(x)

To run it

python conv.py input.txt > output.txt

After that, I can read it in R without a problem.

Thanks!

Upvotes: 0

agstudy
agstudy

Reputation: 121568

In 2 steps , you can do this for example:

## read using `fill=TRUE`
dat <- read.table(text='
123, NAME1, [email protected]
111, NAME2, [email protected]
699, FIRST M. LAST, Jr., [email protected]',sep=',',
                  fill=TRUE,
                  header=FALSE,stringsAsFactors=FALSE)
## concatenate names when they contain a comma
dat$V3 <- ifelse(nchar(dat$V4)>0,paste(dat$V3,dat$V4,sep=','),dat$V3)
dat[,-4]
   V1             V2                       V3
1 123          NAME1       [email protected]
2 111          NAME2         [email protected]
3 699  FIRST M. LAST  Jr., [email protected]

Upvotes: 2

Related Questions