K. Wamae
K. Wamae

Reputation: 225

awk - no output after subtracting two matching columns in two files

I'm learning awk and I'd like to use it to get the difference between two columns in two files

If an entry in file_2 column-2 exists in file_1 column-4, I want to subtract file_2 column-3 from of file_1 column-2

file_1.txt

chrom_1 1000    2000    gene_1
chrom_2 3000    4000    gene_2
chrom_3 5000    6000    gene_3
chrom_4 7000    8000    gene_4

file_2.txt

chrom_1 gene_1  114 252
chrom_9 gene_5  24  183
chrom_2 gene_2  117 269

Here's my code but I get no output:

awk -F'\t' 'NR==FNR{key[$1]=$4;file1col1[$1]=$2;next} $2 in key {print file1col1[$1]-$3}' file_1.txt file_2.txt

Upvotes: 1

Views: 45

Answers (1)

David C. Rankin
David C. Rankin

Reputation: 84652

You are close. But indexing key by the gene name storing the value from the 4th field will allow you to simply subtract key[$2] - $3 to get your result, e.g.

awk 'NR==FNR {key[$4] = $2; next} $2 in key {print key[$2] - $3}' file1 file2
886
2883

(note: there is no gene_5 so key[gene_5] is taken as 0. The test $2 in key conditions the 2nd rule to only execute if the gene is present in key)

Write the Rules Out

Sometimes it helps to write the rules for the script out rather than trying to make a 1-liner out of the script. This allows for better readability. For example:

awk '
  NR==FNR {                 # Rule1 conditioned by NR==FNR (file_1)
    key[$4] = $2            # Store value from field 2 indexed by field 4
    next                    # Skip to next record
  }
  $2 in key {               # Rule2 conditioned by $2 in key (file_2)
    print key[$2] - $3      # Output value from file_1 - field 3
  }
' file_1.txt file_2.txt

Further Explanation

awk will read each line of input (record) from the file(s) and it will apply each rule to the record in the order the rules appear. Here, when the record number equals the file record number (only true for file_1), the first rule is applied and then the next command tells awk to skip everything else and go read the next record.

Rule 2 is conditioned by $2 in key which tests whether the gene name from file 2 exists as an index in key. (the value in array test does not create a new element in the array -- this is a useful benefit of this test). If the gene name exists in the key array filled from file_1, then field 3 from file_2 is subtracted from that value and the difference is output.

One of the best refernces to use when learning awk is Tje GNU Awk User's Guide. It provides an excellent reference for awk and any gawk only features are clearly marked with '#'.

Upvotes: 4

Related Questions