Reputation: 1193
I have some awk code that is running really slow. The format of my file is tab delimited 5 column ASCII. I am operating on column 5 to get a count of appropriate characters to alter the value in column 4.
Example input line:
10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
If I find any "^" in $5 I want to not count it, or the following character. Then I want to find out how many characters are ">" or "<" or "*" and remove them from the count. I'm guessing using a gsub, and 3 splits is less than ideal, especially since column 5 can occasionally be a very very long string.
awk '{l=$4; if($5~/>/ || $5~/</ || $5~/*/ ) {gsub(/\^./,"");l-=split($5,a,"<")-1;l-=split($5,a,">")-1;l-=split($5,a,"*")-1}
If the code runs successfully on the line above, l will be 27.
I am omitting the surrounding parts of the command to try and focus on the part I have a question about.
So, what is the best step to make this run faster?
Upvotes: 0
Views: 1528
Reputation: 58473
This might work for you:
echo "10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a" |
awk '/[><*^]/{t=$5;gsub(/[><*]|[\^]./,"",t);$4=length(t)}1'
10 5134832 N 27 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
if you want to show the amended fifth field:
awk '/[><*^]/{gsub(/[><*]|[\^]./,"",$5);$4=length($5)}1'
Upvotes: 0
Reputation: 247012
Here's a guess:
awk '
BEGIN {FS = OFS = "\t"}
{
str = $5
gsub(/\^.|[><*]/, "", str)
l = length(str)
}
'
Upvotes: 1
Reputation: 48310
Do you need to use awk
, or will this work instead?
cut -f 5 < $file | grep -v '^[A-Z]' | tr -d '<>*\n' | wc -c
Translation:
$file
.<
, >
, *
, and newlines.Upvotes: 1
Reputation: 51643
Well as I see, your gsub
pattern will not work, as the /
was not closed. Anyway, if I get it correctly and you want the character count of $5
without some characters, I'd go with:
count=length(gensub("[><A-Z^]","","g",$5))
You should list your skippable characters between [
and ]
, and do not start with ^
!
Upvotes: 1