jesse_galley
jesse_galley

Reputation: 1676

Using grep to remove lines from a file, where the keys exist in another file

I am working with 2 large CSV files. The smaller being a subset of the larger. The first field is a non-unique key, which is a customerID.

I want to find all lines from the larger file which have the same value in field 1 as the smaller file, then find the delta between this result, and the original subset.

I then want to remove all rows from the original subset which have a value in field one that exists in the delta.

In other words: I want to remove any rows from the original smaller subset which has a custromerID which also exists in a row existing in the large original file, but not in the smaller subset.

I am currently doing it like below, and I don't understand why the result is 0.

There is obviously a flaw in my logic, and this is clearly not the most elegant way to to do this, so please, I'm open to suggestions for a better way.

File: full.csv

,1052,tec101,UNIX
,1052,ser303,UNIX
,1052,backu2,UNIX
,1052,sma114,UNIX
,1052,appsup,UNIX
,1052,emails,UNIX
,1059,marygs,UNIX
,39835,deepr2,UNIX
,44536,hai499,UNIX
,1274,lemo27,Windows
,48567,wdanro,UNIX
,81860,pro846,UNIX
,1419,graphe,UNIX
,83999,doerf1,UNIX
,1551,taxtri,UNIX
,1572,lodes4,UNIX
,1603,wes244,Windows
,102888,law642,UNIX
,1700,au2960,UNIX

File: subset.csv

,1052,sma114,UNIX
,1052,appsup,UNIX
,1052,emails,UNIX
,1059,marygs,UNIX
,39835,deepr2,UNIX
,44536,hai499,UNIX
,1274,lemo27,Windows
,48567,wdanro,UNIX
,81860,pro846,UNIX
,1419,graphe,UNIX

What I'm doing now:

[jgalley@linux1 sandbox]$ wc -l *
 19 full.csv
 10 subset.csv
 29 total
[jgalley@linux1 sandbox]$ cat subset.csv |  awk -F, '{print ","$2","}' > subset_keys
[jgalley@linux1 sandbox]$ grep -F -f subset_keys full.csv | wc -l
13
[jgalley@linux1 sandbox]$ grep -F -f subset_keys full.csv | head -n2
,1052,tec101,UNIX
,1052,ser303,UNIX
[jgalley@linux1 sandbox]$ grep -F -f subset_keys full.csv > subset_keys_grep
[jgalley@linux1 sandbox]$ cat subset_keys_grep | awk -F, '{print ","$2","}' | head -n2
,1052,
,1052,
[jgalley@linux1 sandbox]$ cat subset_keys_grep | awk -F, '{print ","$2","}' | wc -l
13
[jgalley@linux1 sandbox]$ cat subset_keys_grep | awk -F, '{print ","$2","}' > keys_to_remove
[jgalley@linux1 sandbox]$ grep -F -f keys_to_remove subset.csv | wc -l
10
[jgalley@linux1 sandbox]$ grep -F -f keys_to_remove subset.csv > lines_to_remove
[jgalley@linux1 sandbox]$ grep -Fv -f lines_to_remove  subset.csv | wc -l
0

My expected result should be 7, or, without the count, the following:

,1059,marygs,UNIX
,39835,deepr2,UNIX
,44536,hai499,UNIX
,1274,lemo27,Windows
,48567,wdanro,UNIX
,81860,pro846,UNIX
,1419,graphe,UNIX

The result should be the 7 rows of the subset that have customerIDs that exist ONLY in the subset, and not also elsewhere in the full file.

Upvotes: 1

Views: 2545

Answers (3)

Ed Morton
Ed Morton

Reputation: 203645

Try this. I modified your input "full.csv" to remove the first 3 lines:

$ cat full1.csv
,1052,sma114,UNIX
,1052,appsup,UNIX
,1052,emails,UNIX
,1059,marygs,UNIX
,39835,deepr2,UNIX
,44536,hai499,UNIX
,1274,lemo27,Windows
,48567,wdanro,UNIX
,81860,pro846,UNIX
,1419,graphe,UNIX
,83999,doerf1,UNIX
,1551,taxtri,UNIX
,1572,lodes4,UNIX
,1603,wes244,Windows
,102888,law642,UNIX
,1700,au2960,UNIX

so your requirements stand out a bit better for the case where 1 key appears multiple times in you subset.csv file. It assumes the order of lines in the subset.csv file matches the order in the full.csv file. If that's not the case it just takes a tweak to split the strings...

$ cat test.awk                       
BEGIN{ FS="," }
NR==FNR { key2full[$2] = key2full[$2] $0 ORS; next }
{ key2subset[$2] = key2subset[$2] $0 ORS }
END {
   for (key in key2subset) {
      if (key2subset[key] == key2full[key]) {
         printf "%s", key2subset[key]
      }
   }
}
$ awk -f test.awk full1.csv subset.csv
,1052,sma114,UNIX
,1052,appsup,UNIX
,1052,emails,UNIX
,1419,graphe,UNIX
,44536,hai499,UNIX
,48567,wdanro,UNIX
,1274,lemo27,Windows
,81860,pro846,UNIX
,39835,deepr2,UNIX
,1059,marygs,UNIX

I seem to have caused some confusion by running my command on a modified input file above. Here it is on the originally so here it is running against the original file to show that it does produce the desired output:

$ cat full.csv 
,1052,tec101,UNIX
,1052,ser303,UNIX
,1052,backu2,UNIX
,1052,sma114,UNIX
,1052,appsup,UNIX
,1052,emails,UNIX
,1059,marygs,UNIX
,39835,deepr2,UNIX
,44536,hai499,UNIX
,1274,lemo27,Windows
,48567,wdanro,UNIX
,81860,pro846,UNIX
,1419,graphe,UNIX
,83999,doerf1,UNIX
,1551,taxtri,UNIX
,1572,lodes4,UNIX
,1603,wes244,Windows
,102888,law642,UNIX
,1700,au2960,UNIX

$ awk -f test.awk full.csv subset.csv
,1419,graphe,UNIX
,44536,hai499,UNIX
,48567,wdanro,UNIX
,1274,lemo27,Windows
,81860,pro846,UNIX
,39835,deepr2,UNIX
,1059,marygs,UNIX

Upvotes: 1

iruvar
iruvar

Reputation: 23364

Based on your statement that all records in subset.csv exist in full.csv, the following should work in bash. It first uses uniq to identify qualifying CustomerIDs from full.csv. This is then just in a join back with subset.csv for filtration.

join -o 2.1,2.2,2.3,2.4 -v 2 -t, -1 2 -2 2 <(sort full.csv subset.csv | uniq -u | 
sort -k2,2 -t,) <(sort -k2,2 -t, subset.csv)

Upvotes: 0

Olaf Dietsche
Olaf Dietsche

Reputation: 74028

This should do the trick:

grep -v -f subset.csv full.csv | awk -F, '{print ","$2",";}' >keys.csv
grep -v -f keys.csv subset.csv

Upvotes: 0

Related Questions