Michael Gruenstaeudl
Michael Gruenstaeudl

Reputation: 1793

Using awk and inverse grep for specific columns across files

Assume a tab-separated file as input file 1 ($IN1):

$ cat input1.tsv
HEAD1   HEAD2   HEAD3   HEAD4   HEAD5   HEAD6
Qux ZX_999876   Bar Foo MN111111    Quux
Foo AB_123456   Bar Baz CD789123    Qux
Bar AC_456321   Baz Qux GF333444    Foo
Foo CD789123    Qux Baz GH987124    Qux

Further assume a csv-separated file as input file 2 ($IN2):

$ cat input2.csv
AB_123456,CD789123
ZX_999876,MN111111

Input file 2 specifies which lines of input file 1 need to be deleted. Specifically, if column 1 entry of input file 2 is equal to the column 2 entry of input file 1, then all lines of input file 1 in which column 2 entry of input file 2 is equal to the column 2 entry of input file 1 are to be removed.

Desired output:

$ cat input1_filtered.tsv
HEAD1   HEAD2   HEAD3   HEAD4   HEAD5   HEAD6
Qux ZX_999876   Bar Foo MN111111    Quux
Foo AB_123456   Bar Baz CD789123    Qux
Bar AC_456321   Baz Qux GF333444    Foo

I wish to achieve the desired output via Bash. My (so far insufficient) attempt:

while IFS=, read -r CL1 CL2
do
    if awk -F"\t" '$2=="$CL1"' $IN1
    then
        grep -vPw "(?<=\t)$CL2" $IN1 > tmp.txt
        mv tmp.txt $IN1
    fi
done < $IN2

Upvotes: 5

Views: 93

Answers (3)

markp-fuso
markp-fuso

Reputation: 35256

Assumptions:

  • in the first file (input.tsv) the HEAD2 values are unique within the file so ...
  • when it comes to removing a line we will find either 0 or 1 match (ie, there will never be 2+ matches)

One awk idea requiring a single pass through each input file:

awk '
FNR==1  { FS = (FNR==NR) ? "," : "\t"         # 1st file uses "," as delimiter; 2nd file uses "\t" as delimiter
          $0=$0                               # force awk to reparse based on new FS
        }
FNR==NR { keypair[$1] = $2                                  # 1st file: save key pairs
          next                                              # skip to next line of input
        }
        { line[FNR] = $0                                    # 2nd file: save line based on line number (FNR)
          key_to_line[$2] = FNR                             # save relationship between key and line number
        }
END     { for (key in keypair)                              # for each key in our pairs of keys ...
              if (key in key_to_line)                       # if there is an input line with this key then ...
                 delete line[key_to_line[keypair[key]]]     # delete the line (if it exists) which contains the 2nd half of our key pair

          for (i=1; i<=NF; i++)                             # print the lines that remain
              if (i in line)
                 print line[i]
        }
' input2.csv input1.tsv

NOTE: this assumes both files can fit into memory

This generates:

HEAD1   HEAD2   HEAD3   HEAD4   HEAD5   HEAD6
Qux ZX_999876   Bar Foo MN111111    Quux
Foo AB_123456   Bar Baz CD789123    Qux
Bar AC_456321   Baz Qux GF333444    Foo

Upvotes: 4

anubhava
anubhava

Reputation: 785856

Here is an awk solution that does the job in 2 pass with any awk version:

cat filt.awk

FNR == NR {
   vals[$1] = $2
   next
}
$2 in vals {
   notNeeded[ vals[$2] ]
}
{
   row[FNR] = $0
}
END {
   for (i=1; i<=FNR; ++i) {
      split(row[i], tok, FS)
      if ( !(tok[2] in notNeeded) )
         print row[i]
   }
}

Then use it as:

awk -F, -f filt.awk input2.csv FS='\t' input1.tsv

HEAD1   HEAD2   HEAD3   HEAD4   HEAD5   HEAD6
Qux ZX_999876   Bar Foo MN111111    Quux
Foo AB_123456   Bar Baz CD789123    Qux
Bar AC_456321   Baz Qux GF333444    Foo

Upvotes: 5

glenn jackman
glenn jackman

Reputation: 247092

This uses GNU awk specifically:

  • process input2.csv, storing the mapping
  • process input1.tsv, determining if col1 of input2.csv is present
  • process input1.tsv a second time, filtering out the unwanted records
gawk '
    BEGINFILE   { FS = ARGIND == 1 ? "," : "\t" }
    ARGIND == 1 { is_present[$1] = $2; next }
    ARGIND == 2 { if ($2 in is_present) del[is_present[$2]] = 1; next }
    !($2 in del)
' input2.csv input1.tsv input1.tsv

outputs

HEAD1   HEAD2   HEAD3   HEAD4   HEAD5   HEAD6
Qux ZX_999876   Bar Foo MN111111    Quux
Foo AB_123456   Bar Baz CD789123    Qux
Bar AC_456321   Baz Qux GF333444    Foo

ref: ARGIND, BEGINFILE

Upvotes: 6

Related Questions