Reputation: 119
I have multiple CSV files in a directory, all with the same data model, some of which are duplicate entires.
file1.csv:
1 joe red
2 bill blue
3 bob green
file2.csv:
3 bob green
4 mary white
5 jim yellow
file3.csv:
5 jim yellow
6 lauren pink
7 george purple
My goal is to produce a single CSV file of all the unique values.
So far the code I have is this:
Concatenated all the files in iTerm2 using
cat *.csv > combined.csv
And this script:
require 'csv'
File.open("all_unique_rows.csv", "w+") {
|file| file.puts File.readlines("combined.csv").uniq
}
However, I want to be able to do this all from a single Ruby script, however I'm just not sure how to make the "combined.csv" file one giant file using Ruby.
Upvotes: 0
Views: 4592
Reputation: 160621
I wouldn't do this in Ruby if your records are true duplicates. Instead, take advantage of existing tools in the OS made for this:
cat *.csv | sort -u >unique.csv
When finished, "unique.csv" will contain the unique records.
If you insist in writing it in Ruby, then take advantage of built-in methods or classes. Here is one untested way to do it:
require 'set'
unique = Set.new
Dir.glob('*.csv') do |f|
File.foreach(f) { |l| unique << l }
end
File.write('unique.csv', unique.sort.join)
This works to create a unique output because Sets don't allow duplicates.
An alternate way is to do something like:
unique = []
Dir.glob('*.csv') do |f|
unique += File.readlines(f)
end
File.write('unique.csv', unique.sort.uniq.join)
While Ruby can do this, using the OS to handle it is much more scalable. YMMV.
I tried running cat *.csv | sort - u > unique.csv in OS, but ended up with some values misplaced into the wrong columns.
I created the three files on my disk:
$ cat file1.csv 1 joe red 2 bill blue 3 bob green
$ cat file2.csv 3 bob green 4 mary white 5 jim yellow
$ cat file3.csv 5 jim yellow 6 lauren pink 7 george purple
Running cat *.csv | sort -u >unique.csv
and looking at the resulting file shows:
$ cat unique.csv 1 joe red 2 bill blue 3 bob green 4 mary white 5 jim yellow 6 lauren pink 7 george purple
The duplicates are removed, and the file is identical to the input samples you gave. Your "file3.csv" shows an extra space in the last to lines, pushing the right-most column over.
Note: Your files are NOT CSV files. CSV stands for "comma-separated values" and there are no commas between your columns. It is possible you originally had TSV ("tab-separated values"), which Ruby's CSV class can read and write, or you had fixed-width columns and somehow added an extra space. It's really important to use the right terminology and do it consistently, especially when asking questions.
Upvotes: 5