Reputation: 225
I'm learning awk
and I'd like to use it to get the difference between two columns in two files
If an entry in file_2 column-2
exists in file_1 column-4
, I want to subtract file_2 column-3
from of file_1 column-2
file_1.txt
chrom_1 1000 2000 gene_1
chrom_2 3000 4000 gene_2
chrom_3 5000 6000 gene_3
chrom_4 7000 8000 gene_4
file_2.txt
chrom_1 gene_1 114 252
chrom_9 gene_5 24 183
chrom_2 gene_2 117 269
Here's my code but I get no output:
awk -F'\t' 'NR==FNR{key[$1]=$4;file1col1[$1]=$2;next} $2 in key {print file1col1[$1]-$3}' file_1.txt file_2.txt
Upvotes: 1
Views: 45
Reputation: 84652
You are close. But indexing key
by the gene name storing the value from the 4th field will allow you to simply subtract key[$2] - $3
to get your result, e.g.
awk 'NR==FNR {key[$4] = $2; next} $2 in key {print key[$2] - $3}' file1 file2
886
2883
(note: there is no gene_5
so key[gene_5]
is taken as 0
. The test $2 in key
conditions the 2nd rule to only execute if the gene is present in key
)
Write the Rules Out
Sometimes it helps to write the rules for the script out rather than trying to make a 1-liner out of the script. This allows for better readability. For example:
awk '
NR==FNR { # Rule1 conditioned by NR==FNR (file_1)
key[$4] = $2 # Store value from field 2 indexed by field 4
next # Skip to next record
}
$2 in key { # Rule2 conditioned by $2 in key (file_2)
print key[$2] - $3 # Output value from file_1 - field 3
}
' file_1.txt file_2.txt
Further Explanation
awk
will read each line of input (record) from the file(s) and it will apply each rule to the record in the order the rules appear. Here, when the record number equals the file record number (only true for file_1), the first rule is applied and then the next
command tells awk
to skip everything else and go read the next record.
Rule 2 is conditioned by $2 in key
which tests whether the gene name from file 2 exists as an index in key
. (the value in array
test does not create a new element in the array -- this is a useful benefit of this test). If the gene name exists in the key
array filled from file_1, then field 3 from file_2 is subtracted from that value and the difference is output.
One of the best refernces to use when learning awk
is Tje GNU Awk User's Guide. It provides an excellent reference for awk
and any gawk
only features are clearly marked with '#'
.
Upvotes: 4