Reputation: 509
I have two files:
candidates.csv
:
id,value
1,123
4,1
2,5
50,5
blacklist.csv
:
1
2
5
3
10
I'd like to remove all rows from candidates.csv
in which the first column (id
) has a value contained in blacklist.csv
. id
is always numeric. In this case I'd like my output to look like this:
id,value
4,1
50,5
So far, my script for identifying the duplicate lines looks like this:
cat candidates.csv | cut -d \, -f 1 | grep -f blacklist.csv -w
This gives me the output
1
2
Now I somehow need to pipe this information back into sed
/awk
/gawk
/... to delete the duplicates, but I don't know how. Any ideas how I can continue from here? Or is there a better solution altogether? My only restriction is that it has to run in bash.
Upvotes: 5
Views: 7565
Reputation: 11489
You can use sed
and grep
together to get the output
$ sed -e 's/[0-9]+/&\,/g' blacklist.csv > filter.csv
$ grep -Fvf filter.csv candidates.csv
id,value
4,1
50,5
sed
command adds a ,
to each id
and output to a filter.csv
. E
is to interprete regex in MacOSX/FreeBSD, same as -r
in GNU sed
.
grep
uses the option f
to compare between files, then remove the lines using v
. F
is for fixed string.
Upvotes: 0
Reputation: 17376
If you are not too concerned about the order of the lines in your candidates.csv
file you could use the following:
join -v 1 -t, <(sort -t, candidates.csv) <(sort blacklist.csv)
The -v 1
requests all the lines from the first file (the sorted candidates.csv
) which do not match on the first field with the second file (the blacklist.csv
). The -t,
just sets the comma as a separator.
If you're concerned about the header line in the candidates.csv
file you could strip it before the sorting or change the order.
Upvotes: 1
Reputation: 26551
What about the following:
awk -F, '(NR==FNR){a[$1];next}!($1 in a)' blacklist.csv candidates.csv
How does this work?
An awk program is a series of pattern-action pairs, written as:
condition { action }
condition { action }
...
where condition
is typically an expression and action
a series of commands. Here, the first condition-action pairs read:
(NR==FNR){a[$1];next}
if the total record count NR
equals the record count of the file FNR
(i.e. if we are reading the first file), store all values in array a
and skip to the next record (do not do anything else)!($1 in a)
if the first field is not in the array a
then perform the default action which is print the line. This will only work on the second file as the condition of the first condition-action pair does not hold.Upvotes: 13