Reputation: 1793
Assume a tab-separated file as input file 1 ($IN1
):
$ cat input1.tsv
HEAD1 HEAD2 HEAD3 HEAD4 HEAD5 HEAD6
Qux ZX_999876 Bar Foo MN111111 Quux
Foo AB_123456 Bar Baz CD789123 Qux
Bar AC_456321 Baz Qux GF333444 Foo
Foo CD789123 Qux Baz GH987124 Qux
Further assume a csv-separated file as input file 2 ($IN2
):
$ cat input2.csv
AB_123456,CD789123
ZX_999876,MN111111
Input file 2 specifies which lines of input file 1 need to be deleted. Specifically, if column 1 entry of input file 2 is equal to the column 2 entry of input file 1, then all lines of input file 1 in which column 2 entry of input file 2 is equal to the column 2 entry of input file 1 are to be removed.
Desired output:
$ cat input1_filtered.tsv
HEAD1 HEAD2 HEAD3 HEAD4 HEAD5 HEAD6
Qux ZX_999876 Bar Foo MN111111 Quux
Foo AB_123456 Bar Baz CD789123 Qux
Bar AC_456321 Baz Qux GF333444 Foo
I wish to achieve the desired output via Bash. My (so far insufficient) attempt:
while IFS=, read -r CL1 CL2
do
if awk -F"\t" '$2=="$CL1"' $IN1
then
grep -vPw "(?<=\t)$CL2" $IN1 > tmp.txt
mv tmp.txt $IN1
fi
done < $IN2
Upvotes: 5
Views: 93
Reputation: 35256
Assumptions:
input.tsv
) the HEAD2
values are unique within the file so ...One awk
idea requiring a single pass through each input file:
awk '
FNR==1 { FS = (FNR==NR) ? "," : "\t" # 1st file uses "," as delimiter; 2nd file uses "\t" as delimiter
$0=$0 # force awk to reparse based on new FS
}
FNR==NR { keypair[$1] = $2 # 1st file: save key pairs
next # skip to next line of input
}
{ line[FNR] = $0 # 2nd file: save line based on line number (FNR)
key_to_line[$2] = FNR # save relationship between key and line number
}
END { for (key in keypair) # for each key in our pairs of keys ...
if (key in key_to_line) # if there is an input line with this key then ...
delete line[key_to_line[keypair[key]]] # delete the line (if it exists) which contains the 2nd half of our key pair
for (i=1; i<=NF; i++) # print the lines that remain
if (i in line)
print line[i]
}
' input2.csv input1.tsv
NOTE: this assumes both files can fit into memory
This generates:
HEAD1 HEAD2 HEAD3 HEAD4 HEAD5 HEAD6
Qux ZX_999876 Bar Foo MN111111 Quux
Foo AB_123456 Bar Baz CD789123 Qux
Bar AC_456321 Baz Qux GF333444 Foo
Upvotes: 4
Reputation: 785856
Here is an awk
solution that does the job in 2 pass with any awk version:
cat filt.awk
FNR == NR {
vals[$1] = $2
next
}
$2 in vals {
notNeeded[ vals[$2] ]
}
{
row[FNR] = $0
}
END {
for (i=1; i<=FNR; ++i) {
split(row[i], tok, FS)
if ( !(tok[2] in notNeeded) )
print row[i]
}
}
Then use it as:
awk -F, -f filt.awk input2.csv FS='\t' input1.tsv
HEAD1 HEAD2 HEAD3 HEAD4 HEAD5 HEAD6
Qux ZX_999876 Bar Foo MN111111 Quux
Foo AB_123456 Bar Baz CD789123 Qux
Bar AC_456321 Baz Qux GF333444 Foo
Upvotes: 5
Reputation: 247092
This uses GNU awk specifically:
gawk '
BEGINFILE { FS = ARGIND == 1 ? "," : "\t" }
ARGIND == 1 { is_present[$1] = $2; next }
ARGIND == 2 { if ($2 in is_present) del[is_present[$2]] = 1; next }
!($2 in del)
' input2.csv input1.tsv input1.tsv
outputs
HEAD1 HEAD2 HEAD3 HEAD4 HEAD5 HEAD6
Qux ZX_999876 Bar Foo MN111111 Quux
Foo AB_123456 Bar Baz CD789123 Qux
Bar AC_456321 Baz Qux GF333444 Foo
Upvotes: 6