Reputation: 3022
The awk
below runs, however the output file is 0 bytes. It is basically matching input files that are 21 - 259 records to a file of 11,137,660 records. Basically, what it does is use the input files of which there are 4 to search and match in a large 11,000,000 record file and output the average of all the $7
in the matches. I can not seem to figure out why the file is empty. Thank you :).
input
AGRN
CCDC39
CCDC40
CFTR
search
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 1 0
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 2 2
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 3 2
expected output
chr1:955543 AGRN|gc=75 1.3
awk
awk '
NR == FNR {input[$0]; next}
{
split($5, a, "-")
if (a[1] in input) {
key = $4 OFS $5
n[key]++
sum[key] += $7
}
}
END {
for (key in n)
printf "%s %.1f\n", key, sum[key]/n[key]
}
' search.txt input.txt > output.txt
Upvotes: 1
Views: 359
Reputation: 33601
Because the search file comes first in ARGV, you can't do the data matchup until END [as input
will be empty].
Here's what I think will work. Based upon your test files, it produces a single line of output:
chr1:955543 AGRN-6|gc=75 0.7
Here is the script file, invoked with awk -f script.awk search.txt input.txt
:
BEGIN {
slen = 0;
}
# get input file(s)
# NOTE: IMO, this is a cleaner better test condition
ARGIND > 1 {
###printf("input_push: DEBUG %s\n",$0);
input[$0];
next;
}
# get single search list
{
###printf("search_push: DEBUG %s\n",$0);
search[slen++] = $0;
next;
}
END {
# sum up data
for (sidx = 0; sidx < slen; ++sidx) {
sval = search[sidx];
###printf("search_end: DEBUG %s\n",sval);
split(sval,sary)
split(sary[5],a,"-");
###printf("search_end: DEBUG sary[5]='%s' a[1]='%s'\n",sary[5],a[1]);
if (a[1] in input) {
key = sary[4] OFS sary[5]
n[key]++
sum[key] += sary[7]
}
}
for (key in n)
printf "%s %.1f\n", key, sum[key]/n[key]
}
Upvotes: 2