Looping AWK associative array

Question

I am using a column in one file to look up values in another file. The second file is very large and I would like to find all the values in a single pass with awk. I have tried doing this with an associative array, but am stumped how to get the out put i want. I want to take F1, use $2 to look up values in F2, and get the output I show below, which is $0 from F1 as the header, followed by $10 from F2 sorted and counted for each unique string ( ie pipped through sort | uniq -c).

F1

+ID=dnaK.p01 12121 TTGGGCAGTTGAAACCAGACGTTTCGCCCCTATTACAGAC[T]CACAACCACATGATGACCG

F2

solid309_20110930_FRAG_BC_bcSample12273_1541_657_F3     0       NC_012759       12121   42      35M     *       0       0       ACACAACCACATGATGACCGAATATATAGTGGCTC     BBBBBBA@BBBAB@?B@BBBB<5BBBAA@:>>&B7
solid309_20110930_FRAG_BC_bcSample12295_323_1714_F3     0       NC_012759       12121   42      35M     *       0       0       ACACAACCACATGATGACCGAATATATAGTGGAGA     BB@@A@@A@@@?@<=?@@=><6*7=?9993>4&7,
solid309_20110930_FRAG_BC_bcSample12325_1148_609_F3     0       NC_012759       12121   42      35M     *       0       0       ACACAACCACATGATGACCGAATATATAGTGGAGA     BBBB@B@?@B@@A@??BBBA@<.<==:6:1>9(<-
solid309_20110930_FRAG_BC_bcSample11796_1531_1170_F3    0       NC_012759       12122   42      35M     *       0       0       CACAACCACATGATGACCGAATATATAGTGGAGCA     '&&+&&)&')&0(.,',(.3+&&&+,&&&&&&&&&
solid309_20110930_FRAG_BC_bcSample12110_1166_1149_F3    0       NC_012759       12122   42      35M     *       0       0       CACAACCACATGATGACCGAATATATAGTGGAGAC     -(:18)538;,9277*'8:<)&,0-+)//3&'1+'
solid309_20110930_FRAG_BC_bcSample183_686_962_F3        0       NC_012759       12123   42      35M     *       0       0       ACAACCACATGATGACCGAATATATAGTGGAGTGC     BB?BBBB;BBBBBB@ABB;@7AA@@A@*>?+B8@9

I am doing this with the following script

for line in `awk '{if ($1~"-") print ($2-34);else print $2}' $1`
do
awk -v l=$line '{if ($1~"-") l=l+34;if ($2==l) print }' $1 >> f2
awk -v l=$line '{if ($4==l) print $10}' URA2.sam | sort | uniq -c |awk '{if ($1>15) print}'>> f2
done

Which requires multiple passes with awk for each line in. I was thinking I could use an associative array made from F1 to do this with one pass. F2 is sorted by $4. I used the following script to try to get the output I wanted.

awk 'FNR==NR{a[$2]=$0;next}$4 in a{print $10}' f1 f2 | sort | uniq -c

Ed Morton · Accepted Answer

Here's the output using GNU awk for asorti() and delete array:

$ cat tst.awk
function prtCounts(     val,sorted,idx) {
    if (prev in f1) {
        print f1[prev]
        asorti(count,sorted)
        for (idx=1; idx in sorted; idx++) {
            val = sorted[idx]
            print count[val], val
        }
    }
    delete count
}

NR==FNR { f1[$2] = $0; next }

{
    if ( (FNR>1) && ($4!=prev) )
        prtCounts()
    count[$10]++
    prev = $4
}

END { prtCounts() }

$ gawk -f tst.awk file1 file2
a 1 b c d
1 BALH
2 BLAH
b 2 b c d
1 HAHA
2 ZAHA

Looping AWK associative array

Answers (2)

Related Questions