lcm
lcm

Reputation: 1767

Ignoring lines when combining CSV files using cat or tail?

I have been using cat to combine a bunch of csv files before using sort to delete the dupes before importing into MySQL. As dirty as working with tons of csv files can be its been smooth sailing until I found that some of my data is not making it into the db.

What I found was that during the combining of the files, in some cases, the column names (first line of csv files) is getting thrown off into additional columns. The result is when I use sort to remove duplicates and spit out a new file it outputs only up until the line where there was an issue with column names being thrown into new unused columns in the csv file.

I'm using cat like the following:

  cat *.csv >combined.csv

and sort to

  sort -u combined.csv -o cleaned.csv

I then made my way to tail which is promising:

tail -n+2 *.csv >combined.csv

However with how I am using tail I am getting the actual file name in a row right before where each csv file that was combined

==> first-file.csv <==
red     | 1234
yellow  | 5678 
blue    | 9123
green   | 4567
orange  | 8901
black   | 2345
white   | 6789
==> second-file.csv <==
brown   | 1234
gray    | 5678 
tan     | 9123
burgundy| 4567

Instead of:

red     | 1234
yellow  | 5678 
blue    | 9123
green   | 4567
orange  | 8901
black   | 2345
white   | 6789
brown   | 1234
gray    | 5678 
tan     | 9123
burgundy| 4567

Any help would be most appreciated here. I am going to have to go through all these files again so I need to get it right this time.

Let me know if clarification is needed. I am running on a mac with production on linux and would ideally like to accomplish this with cat, sort, tail or similar.

EDIT: To recreate the problem with generic data just save the following data in a as csv in two separate files. I named them test-1.csv and test-2.csv.

color, votes, trend
"red", "1234", "1,3,3,4"
"yellow", "5678", "2,3,3,4"
"blue", "9123", "2,3,3,4"
"green", "4567", "5,3,3,4"
"orange", "8901", "2,2,3,4"
"black", "2345", "2,1,3,4"
"white", "6789", "2,5,3,4"
"brown", "1234", "2,7,3,4"
"gray", "5678", "8,2,3,4"
"tan", "9123", "9,3,3,4"
"burgundy", "4567", "2,5,1,4" 

Then just run:

tail -q -n +2 *.csv > combined.csv

Upvotes: 1

Views: 637

Answers (1)

Rubens
Rubens

Reputation: 14778

By default, whenever there's more the one input file, tail outputs a header with associate file names. To ditch this feature, use -q:

-q, --quite, --silent
    never output headers giving file names

Your command line should go as follows:

tail -q -n +2 *.csv > combined.csv

Upvotes: 1

Related Questions