Reputation: 33

How to remove multiple commas but keep one in between two values in a csv file?

I have a csv file with millions of records like below

1,,,,,,,,,,a,,,,,,,,,,,,,,,,4,,,,,,,,,,,,,,,456,,,,,,,,,,,,,,,,,,,,,3455,,,,,,,,,,
1,,,,,,,,,,b,,,,,,,,,,,,,,,,5,,,,,,,,,,,,,,,467,,,,,,,,,,,,,,,,,,,,,3445,,,,,,,,,,
2,,,,,,,,,,c,,,,,,,,,,,,,,,,6,,,,,,,,,,,,,,,567,,,,,,,,,,,,,,,,,,,,,4656,,,,,,,,,,

I have to remove the extra commas between two values and keep only one. The output for the sample input should look like

1,a,4,456,3455
1,b,5,467,3445
2,c,6,567,4656

How can I achieve this using shell since it automates for the other files too. I need to load this data in to a database. Can we do it using R?

Upvotes: 3

Answers (5)

agc

Reputation: 8406

Use tr -s:

echo 'a,,,,,,,,b,,,,,,,,,,c' | tr -s ','

Output:

a,b,c

If the input line has trailing commas, tr -s ',' would squeeze those trailing commas into one comma, but to be rid that one requires adding a little sed code: tr -s ',' | sed 's/,$//'.

Speed. Tests on a 10,000,000 line test file consisting of the first line in the OP example, repeated.

3 seconds. tr -s ',' (but leaves trailing comma)
9 seconds. tr -s ',' | sed 's/,$//
30 seconds. sed -e "s/,\+/,/g" -e "s/,$//" (Jean-François Fabre's answer.)

Upvotes: 4

G5W

Reputation: 37641

Edited to address modified question.

R solution.

The original solution provided was just processing text. Assuming that your rows are in a structure, you can handle multiple rows with:

# Create Data
Row1 = "1,,,,,,,a,,,,,,,,,,4,,,,,,,,,456,,,,,,,,,,,3455,,,,,,,"
Row2 = "2,,,,,,,b,,,,,,,,,,5,,,,,,,,,567,,,,,,,,,,,4566,,,,,,,"
Rows = c(Row1, Row2)

CleanedRows = gsub(",+", ",", Rows)           # Compress multiple commas
CleanedRows = sub(",\\s*$", "", CleanedRows)  # Remove final comma if any
[1] "1,a,4,456,3455" "2,b,5,567,4566"

But if you are trying to read this from a csv and compress the rows,

## Create sample data
Data =read.csv(text="1,,,,,,,a,,,,,,,,,,4,,,,,,,,,456,,,,,,,,,,,3455,,,,,,,
2,,,,,,,b,,,,,,,,,,5,,,,,,,,,567,,,,,,,,,,,4566,,,,,,,",
header=FALSE)

You code would probably say Data = read.csv("YourFile.csv", header=FALSE)

Data = Data[which(!is.na(Data[1,]))]
Data
  V1 V8 V18 V27  V38
1  1  a   4 456 3455
2  2  b   5 567 4566

Note: This assumes that the non-blank fields are in the same place in every row.

Upvotes: 4

Roland

Reputation: 132676

Can we do it using R?

Provided your input is as shown, i.e., you want to skip the same columns in all rows, you can analyze the first line and then define column classes in read.table:

text <- "1,,,,,,,,,,a,,,,,,,,,,,,,,,,4,,,,,,,,,,,,,,,456,,,,,,,,,,,,,,,,,,,,,3455,,,,,,,,,,
         1,,,,,,,,,,b,,,,,,,,,,,,,,,,5,,,,,,,,,,,,,,,467,,,,,,,,,,,,,,,,,,,,,3445,,,,,,,,,,
         2,,,,,,,,,,c,,,,,,,,,,,,,,,,6,,,,,,,,,,,,,,,567,,,,,,,,,,,,,,,,,,,,,4656,,,,,,,,,,"

tmp <- read.table(text = text, nrows = 1, sep = ",")
colClasses <- sapply(tmp, class)
colClasses[is.na(unlist(tmp))] <- "NULL"

Here I assume there are no actual NA values in the first line. If there could be, you'd need to adjust it slightly.

read.table(text = text, sep = ",", colClasses = colClasses)
#  V1 V11 V27 V42  V63
#1  1   a   4 456 3455
#2  1   b   5 467 3445
#3  2   c   6 567 4656

Obviously, you'd specify a file instead of text.

This solution is fairly efficient for smallish to moderately sized data. For large data, substitute the second read.table with fread from package data.table (but that applies regardless of the skipping columns problem).

Upvotes: 0

user2832874

Reputation:

If you have a file that's really a CSV file, it might have quoting of commas in a few different ways, which can make regex-based CSV parsing unhappy.

I generally use and recommend csvkit which has a nice set of CSV parsing utilities for the shell. Docs at http://csvkit.readthedocs.io/en/latest/

Your exact issue is answered in csvkit with this set of commands. First, csvstat shows what the file looks like:

$ csvstat -H --max tmp.csv | grep -v None 1. column1: 2 11. column11: c 27. column27: 6 42. column42: 567 63. column63: 4656

Then, now that you know that all of the data is in those columns, you can run this:

$ csvcut -c 1,11,27,42,63 tmp.csv 1,a,4,456,3455 1,b,5,467,3445 2,c,6,567,4656

to get your desired answer.

Upvotes: 0

Jean-François Fabre

Reputation: 140168

sed method:

sed -e "s/,\+/,/g" -e "s/,$//" input_file > output_file

Turns multiple commas to single comma and also remove last comma on line.

Upvotes: 4

How to remove multiple commas but keep one in between two values in a csv file?

Answers (5)

Related Questions