Reputation: 2866
I am using a column in one file to look up values in another file. The second file is very large and I would like to find all the values in a single pass with awk. I have tried doing this with an associative array, but am stumped how to get the out put i want. I want to take F1, use $2 to look up values in F2, and get the output I show below, which is $0 from F1 as the header, followed by $10 from F2 sorted and counted for each unique string ( ie pipped through sort | uniq -c).
F1
+ID=dnaK.p01 12121 TTGGGCAGTTGAAACCAGACGTTTCGCCCCTATTACAGAC[T]CACAACCACATGATGACCG
F2
solid309_20110930_FRAG_BC_bcSample12273_1541_657_F3 0 NC_012759 12121 42 35M * 0 0 ACACAACCACATGATGACCGAATATATAGTGGCTC BBBBBBA@BBBAB@?B@BBBB<5BBBAA@:>>&B7
solid309_20110930_FRAG_BC_bcSample12295_323_1714_F3 0 NC_012759 12121 42 35M * 0 0 ACACAACCACATGATGACCGAATATATAGTGGAGA BB@@A@@A@@@?@<=?@@=><6*7=?9993>4&7,
solid309_20110930_FRAG_BC_bcSample12325_1148_609_F3 0 NC_012759 12121 42 35M * 0 0 ACACAACCACATGATGACCGAATATATAGTGGAGA BBBB@B@?@B@@A@??BBBA@<.<==:6:1>9(<-
solid309_20110930_FRAG_BC_bcSample11796_1531_1170_F3 0 NC_012759 12122 42 35M * 0 0 CACAACCACATGATGACCGAATATATAGTGGAGCA '&&+&&)&')&0(.,',(.3+&&&+,&&&&&&&&&
solid309_20110930_FRAG_BC_bcSample12110_1166_1149_F3 0 NC_012759 12122 42 35M * 0 0 CACAACCACATGATGACCGAATATATAGTGGAGAC -(:18)538;,9277*'8:<)&,0-+)//3&'1+'
solid309_20110930_FRAG_BC_bcSample183_686_962_F3 0 NC_012759 12123 42 35M * 0 0 ACAACCACATGATGACCGAATATATAGTGGAGTGC BB?BBBB;BBBBBB@ABB;@7AA@@A@*>?+B8@9
I am doing this with the following script
for line in `awk '{if ($1~"-") print ($2-34);else print $2}' $1`
do
awk -v l=$line '{if ($1~"-") l=l+34;if ($2==l) print }' $1 >> f2
awk -v l=$line '{if ($4==l) print $10}' URA2.sam | sort | uniq -c |awk '{if ($1>15) print}'>> f2
done
Which requires multiple passes with awk for each line in. I was thinking I could use an associative array made from F1 to do this with one pass. F2 is sorted by $4. I used the following script to try to get the output I wanted.
awk 'FNR==NR{a[$2]=$0;next}$4 in a{print $10}' f1 f2 | sort | uniq -c
Upvotes: 2
Views: 1509
Reputation: 203645
Here's the output using GNU awk for asorti()
and delete array
:
$ cat tst.awk
function prtCounts( val,sorted,idx) {
if (prev in f1) {
print f1[prev]
asorti(count,sorted)
for (idx=1; idx in sorted; idx++) {
val = sorted[idx]
print count[val], val
}
}
delete count
}
NR==FNR { f1[$2] = $0; next }
{
if ( (FNR>1) && ($4!=prev) )
prtCounts()
count[$10]++
prev = $4
}
END { prtCounts() }
$ gawk -f tst.awk file1 file2
a 1 b c d
1 BALH
2 BLAH
b 2 b c d
1 HAHA
2 ZAHA
Upvotes: 4
Reputation: 36262
I would process F1
first and use an associate array to save, the second field as the key and the whole line as the value. Then you only will have to process F2
once, and each time that fourth field changes, print the number of repeated values in the tenth field, like:
awk '
## Process F1.
FNR == NR {
f1[$2] = $0
next
}
## Process F2.
{
if (FNR == 1 || prev_key == $4) {
f2[$10]++
}
else {
printf "%s\n", f1[prev_key]
for (key in f2) {
printf "%d %s\n", f2[key], key
}
delete f2
f2[$10]++
}
prev_key = $4
}
END {
printf "%s\n", f1[prev_key]
for (key in f2) {
printf "%d %s\n", f2[key], key
}
}
' F1 F2
It yields:
a 1 b c d
1 BALH
2 BLAH
b 2 b c d
2 ZAHA
1 HAHA
Upvotes: 1