Reputation: 387
Is it possible to use awk to compare and return results from both files that match?
I am currently using:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{c[$1$2]++;next};c{$1$2}>0' queryfile hitsfile
to match results from query and return outputs in hits, however it only returns the columns from the hits files
I've tried:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{c[$1$2]++;next};c{$1$2}>0 {print $1,$2,c[$1]}'
but it doesnt work
My example data looks like this:
queryfile
chr1 1000 1005 BDSD
chr1 1010 1015 SKK1
chr2 1015 1015 AVPR
hitsfile
chr1 1000 1005 0.5
chr1 1001 1002 0.35
chr1 1010 1015 0.4
chr1 1011 1016 0.56
chr2 1015 1015 0.1
I would like my output file to look like the following
*output results*
chr1 1000 1005 0.5 BDSD
chr1 1010 1015 0.4 SKK1
chr2 1015 1015 0.1 AVPR
So basically, the hits that match the query is returned PLUS another column in the query data. Is this possible using awk oneliners?
Also, another question is is it possible given a query RANGE inside the query file, and return all lines that is within the hitsfile compared to exact matches with awk?
Usually I do these in R, but its slow when processing large files and awk is much much faster!
Thank you!
Upvotes: 1
Views: 271
Reputation: 46826
NOTE: This answer is accurate for a previous version of the question. Please check the question's revision history for details.
If you're designing a process like this in awk, the basic stuff you'll want to think about is that to compare two files, the important bits of one of them will need to be loaded into memory. If you can make sure that the amount of memory you use won't require use of swap, you'll be ahead. :)
So ... assuming queryfile
is small and hitsfile
is large, you'd want something like this:
$ awk '
# First, store every line of our first file in an array. Simply mentioning
# an array element is sufficient, you don't need to assign anything.
NR == FNR {
a[$0];
next;
}
# Second, walk through any remaining data (second file, third, etc),
# comparing it to elements in the array we stored in the section above.
# If the condition here is true, the default action is to print the line.
$0 in a
' queryfile hitsfile
This can obviously be shortened to a one-liner. You know how to do that already.
The net result of this is that each line from the second file will be printed if it appeared in the first file. By extension, only lines appearing in both files will be printed.
Using the sample data you've provided in your question, with this I get output that looks identical to the queryfile, since each item of the queryfile appears once in the hitsfile.
If this isn't the result you're looking for, please provide more detailed explanation, and perhaps example output you're looking for, in your question.
Alternate solution:
You might not need to use awk at all.
fgrep -xf queryfile hitsfile
The fgrep
command is equivalent to grep -F
, which compares fixed strings instead of regular expressions. The -x
option tells grep to consider only whole-lines, effectively anchoring nulls at the beginning of the end, like a regex ^...$
. And the -f
option says that the list of matched strings should be taken from the specified file, in this case queryfile
.
End result is that you've got C code running your search rather than an awk script. I'll let you do the benchmarks, since you have the large files, but I'd be interested in knowing the performance difference.
Upvotes: 1
Reputation: 203219
$ awk 'NR==FNR{a[$1,$2]=$4;next} ($1,$2) in a{print $0, a[$1,$2]}' queryfile hitsfile
chr1 1000 1005 0.5 BDSD
chr1 1010 1015 0.4 SKK1
chr2 1015 1015 0.1 AVPR
Upvotes: 1