Reputation: 28339
I have bash and awk script that I use to extract data from the text file.
However it is too slow with large datasets and doesn't work perfectly. I believe that it is possible to write all my bash loop in one awk command and I ask somebody to help me with this.
cat dummy_list
AAA
AAAA
AAAAA
cat dummy_table
13 19 AAA 69 96 "ID-999" 34
23 42 AAA 12 19 "ID-999" 64
53 79 AAA 43 58 "ID-482" 36
13 43 AAA 12 15 "ID-492" 75
23 90 AAA 45 87 "ID-492" 34
12 41 AAAA 76 79 "ID-923" 23
19 58 AAAA 15 87 "ID-923" 75
10 40 AAAA 18 82 "ID-482" 23
11 18 AAAA 18 82 "ID-482" 52
15 19 AAAA 18 82 "ID-482" 62
59 69 AAAA 10 18 "ID-482" 83
78 89 AAAA 32 41 "ID-983" 24
23 53 AAAAA 78 99 "ID-916" 82
What I want from this table:
For every dummy_list item (AAA
or AAAA
or AAAAA
) extract how many different times ID range was mentioned ( by this I mean unique columns 4+5+6 (like 69 96 "ID-999"
)). There are duplicate ID's (like 18 82 "ID-482"
) and I have to discard them.
My script looks like this:
while read a; do
awk -v VAR="$a" '($3==VAR) {print $4"\t"$5"\t"$6}' dummy_table |
sort -u |
cut -f 3 |
sort |
uniq -c |
awk '{print $1}' |
tr '\n' ' ' |
awk -v VAR="$a" '{print VAR"\t"$0}'
done < dummy_list
AAA 1 2 2
AAAA 2 2 1
AAAAA 1
It's the same as AAA "ID-482" mentioned once; "ID-492" mentioned twice; "ID-999" mentioned twice
.
This is the output I want.
For every dummy_list item get average number of how many times it gets mentioned with the same ID. For example AAA
occurs twice with "ID-999"
, one time with "ID-482"
and two times with "ID-492"
- so it's (2+1+2)/3=1.66
My script looks like this:
while read a ; do
ID_TIMES=$(awk -v VAR="$a" '($3==VAR) {print $6}' dummy_table |
sort -u |
wc -l) &&
awk -v VAR="$a" '($3==VAR) {print $6}' dummy_table |
sort |
uniq -c |
awk -v VAR="$ID_TIMES" '{sum+=$1} END {print sum/VAR}'
done < dummy_list
AAA 1.666
AAAA 2.333
AAAAA 1
For every dummy_list item extract ID range and calculate proportion between columns.
For example:
for AAA's ID-999:
RANGE1=sum $5-$4(96-69) + $5-$4(19-12)
RANGE2=sum $7(34+64)
then RANGE2*100/RANGE1=288
For the output like this:
AAA 288 240 242
....
AAAAA 390
I wasn't able to write such script by myself as I got stuck with two variables $RANGE1 and $RANGE2.
If it it possible it would be great to discard duplicate ranges like 18 82 "ID-482"
in this step as well.
I believe that all these there operations can be calculated with only one awk
command and I feel desperate about my scripts. I really hope that someone will help me in this operation.
Upvotes: 0
Views: 377
Reputation: 1125
Only a partial answer but here is a one-liner solution for your first problem:
awk -F' ' '{group[$3]++;ind[$6]++};{count[$3][$6]+=1}; END{for (i in group){for (j in ind) if(count[i][j] > 0) print i, j, count[i][j]}}' dummy_variable.txt
Output:
AAA "ID-482" 1
AAA "ID-999" 2
AAA "ID-492" 2
AAAA "ID-923" 2
AAAA "ID-482" 4
AAAA "ID-983" 1
AAAAA "ID-916" 1
It is then fairly trivil to use this output to calculate the answer to your second question.
Upvotes: 1
Reputation: 5675
You can try this.
file a.awk:
BEGIN {
# read list of items
while ( ( getline < "dummy_list" ) > 0 )
{
items[$1] = 0
}
}
{
# calculate ammountof uniqur ids
key = $3 SUBSEP $6
if ( ! ( key in ids ) && ( $3 in items ) )
{
unique_ids[$3] += 1
}
# calculate ammount of duplication
ids [$3,$6] += 1
# calculate range parameters
range1 [$3,$6] += $5 - $4
range2 [$3,$6] += $7
}
END {
for ( item in items )
{
print "--- item = " item " ---\n"
for ( key in ids )
{
split ( key, s, SUBSEP );
if ( s[1] != item ) continue;
range = range2[key] * 100 / range1[key]
average[item] += float ( ids[key] ) / unique_ids[item];
print "id = " s[2] "\tammount of dup = " ids[key] " range = " int ( range )
}
print "\naverage = " average[item] "\n"
}
}
run:
awk -f a.awk dummy_table
output:
--- item = AAAA ---
id = "ID-983" ammount of dup = 1 range = 266
id = "ID-923" ammount of dup = 2 range = 130
id = "ID-482" ammount of dup = 4 range = 110
average = 2.33333
--- item = AAAAA ---
id = "ID-916" ammount of dup = 1 range = 390
average = 1
--- item = AAA ---
id = "ID-999" ammount of dup = 2 range = 288
id = "ID-482" ammount of dup = 1 range = 240
id = "ID-492" ammount of dup = 2 range = 242
average = 1.66667
There is one moment - I can't understand how you got 225 for "ID-482" and item AAA in question #3.
RANGE2 * 100 / RANGE1 = 36 * 100 / ( 58 - 43 ) = 240.
Are you sure, that your example on question #3 is correct?
Upvotes: 2