Reputation: 1676
I am working with 2 large CSV files. The smaller being a subset of the larger. The first field is a non-unique key, which is a customerID.
I want to find all lines from the larger file which have the same value in field 1 as the smaller file, then find the delta between this result, and the original subset.
I then want to remove all rows from the original subset which have a value in field one that exists in the delta.
In other words: I want to remove any rows from the original smaller subset which has a custromerID which also exists in a row existing in the large original file, but not in the smaller subset.
I am currently doing it like below, and I don't understand why the result is 0.
There is obviously a flaw in my logic, and this is clearly not the most elegant way to to do this, so please, I'm open to suggestions for a better way.
File: full.csv
,1052,tec101,UNIX
,1052,ser303,UNIX
,1052,backu2,UNIX
,1052,sma114,UNIX
,1052,appsup,UNIX
,1052,emails,UNIX
,1059,marygs,UNIX
,39835,deepr2,UNIX
,44536,hai499,UNIX
,1274,lemo27,Windows
,48567,wdanro,UNIX
,81860,pro846,UNIX
,1419,graphe,UNIX
,83999,doerf1,UNIX
,1551,taxtri,UNIX
,1572,lodes4,UNIX
,1603,wes244,Windows
,102888,law642,UNIX
,1700,au2960,UNIX
File: subset.csv
,1052,sma114,UNIX
,1052,appsup,UNIX
,1052,emails,UNIX
,1059,marygs,UNIX
,39835,deepr2,UNIX
,44536,hai499,UNIX
,1274,lemo27,Windows
,48567,wdanro,UNIX
,81860,pro846,UNIX
,1419,graphe,UNIX
What I'm doing now:
[jgalley@linux1 sandbox]$ wc -l *
19 full.csv
10 subset.csv
29 total
[jgalley@linux1 sandbox]$ cat subset.csv | awk -F, '{print ","$2","}' > subset_keys
[jgalley@linux1 sandbox]$ grep -F -f subset_keys full.csv | wc -l
13
[jgalley@linux1 sandbox]$ grep -F -f subset_keys full.csv | head -n2
,1052,tec101,UNIX
,1052,ser303,UNIX
[jgalley@linux1 sandbox]$ grep -F -f subset_keys full.csv > subset_keys_grep
[jgalley@linux1 sandbox]$ cat subset_keys_grep | awk -F, '{print ","$2","}' | head -n2
,1052,
,1052,
[jgalley@linux1 sandbox]$ cat subset_keys_grep | awk -F, '{print ","$2","}' | wc -l
13
[jgalley@linux1 sandbox]$ cat subset_keys_grep | awk -F, '{print ","$2","}' > keys_to_remove
[jgalley@linux1 sandbox]$ grep -F -f keys_to_remove subset.csv | wc -l
10
[jgalley@linux1 sandbox]$ grep -F -f keys_to_remove subset.csv > lines_to_remove
[jgalley@linux1 sandbox]$ grep -Fv -f lines_to_remove subset.csv | wc -l
0
My expected result should be 7, or, without the count, the following:
,1059,marygs,UNIX
,39835,deepr2,UNIX
,44536,hai499,UNIX
,1274,lemo27,Windows
,48567,wdanro,UNIX
,81860,pro846,UNIX
,1419,graphe,UNIX
The result should be the 7 rows of the subset that have customerIDs that exist ONLY in the subset, and not also elsewhere in the full file.
Upvotes: 1
Views: 2545
Reputation: 203645
Try this. I modified your input "full.csv" to remove the first 3 lines:
$ cat full1.csv
,1052,sma114,UNIX
,1052,appsup,UNIX
,1052,emails,UNIX
,1059,marygs,UNIX
,39835,deepr2,UNIX
,44536,hai499,UNIX
,1274,lemo27,Windows
,48567,wdanro,UNIX
,81860,pro846,UNIX
,1419,graphe,UNIX
,83999,doerf1,UNIX
,1551,taxtri,UNIX
,1572,lodes4,UNIX
,1603,wes244,Windows
,102888,law642,UNIX
,1700,au2960,UNIX
so your requirements stand out a bit better for the case where 1 key appears multiple times in you subset.csv file. It assumes the order of lines in the subset.csv file matches the order in the full.csv file. If that's not the case it just takes a tweak to split the strings...
$ cat test.awk
BEGIN{ FS="," }
NR==FNR { key2full[$2] = key2full[$2] $0 ORS; next }
{ key2subset[$2] = key2subset[$2] $0 ORS }
END {
for (key in key2subset) {
if (key2subset[key] == key2full[key]) {
printf "%s", key2subset[key]
}
}
}
$ awk -f test.awk full1.csv subset.csv
,1052,sma114,UNIX
,1052,appsup,UNIX
,1052,emails,UNIX
,1419,graphe,UNIX
,44536,hai499,UNIX
,48567,wdanro,UNIX
,1274,lemo27,Windows
,81860,pro846,UNIX
,39835,deepr2,UNIX
,1059,marygs,UNIX
I seem to have caused some confusion by running my command on a modified input file above. Here it is on the originally so here it is running against the original file to show that it does produce the desired output:
$ cat full.csv
,1052,tec101,UNIX
,1052,ser303,UNIX
,1052,backu2,UNIX
,1052,sma114,UNIX
,1052,appsup,UNIX
,1052,emails,UNIX
,1059,marygs,UNIX
,39835,deepr2,UNIX
,44536,hai499,UNIX
,1274,lemo27,Windows
,48567,wdanro,UNIX
,81860,pro846,UNIX
,1419,graphe,UNIX
,83999,doerf1,UNIX
,1551,taxtri,UNIX
,1572,lodes4,UNIX
,1603,wes244,Windows
,102888,law642,UNIX
,1700,au2960,UNIX
$ awk -f test.awk full.csv subset.csv
,1419,graphe,UNIX
,44536,hai499,UNIX
,48567,wdanro,UNIX
,1274,lemo27,Windows
,81860,pro846,UNIX
,39835,deepr2,UNIX
,1059,marygs,UNIX
Upvotes: 1
Reputation: 23364
Based on your statement that all records in subset.csv exist in full.csv, the following should work in bash. It first uses uniq to identify qualifying CustomerIDs from full.csv. This is then just in a join back with subset.csv for filtration.
join -o 2.1,2.2,2.3,2.4 -v 2 -t, -1 2 -2 2 <(sort full.csv subset.csv | uniq -u |
sort -k2,2 -t,) <(sort -k2,2 -t, subset.csv)
Upvotes: 0
Reputation: 74028
This should do the trick:
grep -v -f subset.csv full.csv | awk -F, '{print ","$2",";}' >keys.csv
grep -v -f keys.csv subset.csv
Upvotes: 0