I am tasks with comparing 2 large unsorted .csv files based on column 1 and 3. Each file contains about 200k records. For the output, I need to know which records based on column 1 and 3 exist in the first file but not the second file. The files are quoted comma separated value files. Column 3 needs to ignore case when comparing. Example File1: "id", "name", "email", "country" "1233", "jake", "jake@mailinator.com", "USA" "2345", "alison", "Alison@mailinator.com", "Canada" "3456", "jacob", "jacob@mailinator.com", "USA" "5678", "natalia", "natalia@mailinator.com", "USA" File 2 "id", "name", "email", "country" "2345", "alison", "alison@mailinator.com", "Canada" "3456", "jacob", "jacob@mailinator.com", "USA" "5690", "lina", "lina@mailinator.com", "Canada" desired Output file "5678", "natalia", "natalia@mailinator.com", "USA" Code examples would be very appreciated.

Reputation: 27

comparing 2 large unsorted CSV files based on 2 columns

I am tasks with comparing 2 large unsorted .csv files based on column 1 and 3. Each file contains about 200k records. For the output, I need to know which records based on column 1 and 3 exist in the first file but not the second file. The files are quoted comma separated value files. Column 3 needs to ignore case when comparing.

Example File1:

"id", "name", "email", "country"
"1233",  "jake", "[email protected]", "USA"
"2345", "alison", "[email protected]", "Canada"
"3456", "jacob", "[email protected]", "USA"
"5678", "natalia", "[email protected]", "USA"

File 2

"id", "name", "email", "country"
"2345", "alison", "[email protected]", "Canada"
"3456", "jacob", "[email protected]", "USA"
"5690", "lina", "[email protected]", "Canada"

desired Output file

"5678", "natalia", "[email protected]", "USA"

Code examples would be very appreciated.

Upvotes: 1

Answers (5)

Newbiee

Reputation: 187

awk 'BEGIN { FS=OFS=","}; NR==FNR{a[tolower($1$3)]=++i;next} { if ( tolower($1$3) in a);else {print } }' file2 file1

output:

"1233",  "jake", "[email protected]", "USA"
"5678", "natalia", "[email protected]", "USA"

Upvotes: 0

Omnaest

Reputation: 3096

Load the file content into a in memory database e.g. H2 and use a SQL selection with a join

Upvotes: 0

Karoly Horvath

Reputation: 96258

awk 'BEGIN { FS="\", \""}
     FNR == 1 {read++;}
     FNR !=1 {if (read==1) {store[$1","tolower($3)] = $0} if (read==2) {delete store[$1","tolower($3)]}}
     END {for (i in store) {print store[i]}}' file1 file2

Output:

"1233",  "jake", "[email protected]", "USA"
"5678", "natalia", "[email protected]", "USA"

Upvotes: 0

dogbane

Reputation: 274592

Try:

join -v 1 -i -t, -1 1 -2 1 -o 1.2 1.3 1.4 1.5  <(awk -F, '{print $1":"$3","$0}' f1.txt | sort) <(awk -F, '{print $1":"$3","$0}' f2.txt | sort)

How it works:

1) I first create a composite key column, by joining column 1 and column3:

awk -F, '{print $1":"$3","$0}' f1.txt
awk -F, '{print $1":"$3","$0}' f2.txt

2) I sort both outputs:

awk -F, '{print $1":"$3","$0}' f1.txt | sort 
awk -F, '{print $1":"$3","$0}' f2.txt | sort

3) I then use the join command to join on the first column (my composite key) and output the unpairable lines coming from file 1.

Output:

"1233",  "jake", "[email protected]", "USA"
"5678", "natalia", "[email protected]", "USA"

Upvotes: 1

atrain

Reputation: 9255

Loop the files, load them up into 2 arrays (or maybe hashes), then loop the second file, array'ing each line. If array1[n] and array2[n] are not in the current line's array, output as missing. I would use Perl for this task.

Upvotes: 0

comparing 2 large unsorted CSV files based on 2 columns

Answers (5)

Related Questions