Nick Lashinsky
Nick Lashinsky

Reputation: 119

How to merge multiple CSV files into one, with only unique values

I have multiple CSV files in a directory, all with the same data model, some of which are duplicate entires.

file1.csv:

1       joe      red
2       bill     blue 
3       bob      green

file2.csv:

3       bob      green
4       mary     white 
5       jim      yellow

file3.csv:

5       jim      yellow
6       lauren    pink  
7       george    purple

My goal is to produce a single CSV file of all the unique values.

So far the code I have is this:

  1. Concatenated all the files in iTerm2 using

     cat *.csv > combined.csv
    
  2. And this script:

    require 'csv'
    
    File.open("all_unique_rows.csv", "w+") { 
    
      |file| file.puts File.readlines("combined.csv").uniq 
    
    }
    

However, I want to be able to do this all from a single Ruby script, however I'm just not sure how to make the "combined.csv" file one giant file using Ruby.

Upvotes: 0

Views: 4592

Answers (1)

the Tin Man
the Tin Man

Reputation: 160621

I wouldn't do this in Ruby if your records are true duplicates. Instead, take advantage of existing tools in the OS made for this:

cat *.csv | sort -u >unique.csv

When finished, "unique.csv" will contain the unique records.

If you insist in writing it in Ruby, then take advantage of built-in methods or classes. Here is one untested way to do it:

require 'set'
unique = Set.new
Dir.glob('*.csv') do |f|
  File.foreach(f) { |l| unique << l }
end
File.write('unique.csv', unique.sort.join)

This works to create a unique output because Sets don't allow duplicates.

An alternate way is to do something like:

unique = []
Dir.glob('*.csv') do |f|
  unique += File.readlines(f)
end
File.write('unique.csv', unique.sort.uniq.join)

While Ruby can do this, using the OS to handle it is much more scalable. YMMV.


I tried running cat *.csv | sort - u > unique.csv in OS, but ended up with some values misplaced into the wrong columns.

I created the three files on my disk:

$ cat file1.csv
1       joe      red
2       bill     blue
3       bob      green
$ cat file2.csv
3       bob      green
4       mary     white
5       jim      yellow
$ cat file3.csv
5       jim      yellow
6       lauren    pink
7       george    purple

Running cat *.csv | sort -u >unique.csv and looking at the resulting file shows:

$ cat unique.csv
1       joe      red
2       bill     blue
3       bob      green
4       mary     white
5       jim      yellow
6       lauren    pink
7       george    purple

The duplicates are removed, and the file is identical to the input samples you gave. Your "file3.csv" shows an extra space in the last to lines, pushing the right-most column over.

Note: Your files are NOT CSV files. CSV stands for "comma-separated values" and there are no commas between your columns. It is possible you originally had TSV ("tab-separated values"), which Ruby's CSV class can read and write, or you had fixed-width columns and somehow added an extra space. It's really important to use the right terminology and do it consistently, especially when asking questions.

Upvotes: 5

Related Questions