lonestar21
lonestar21

Reputation: 1193

speed up my awk command? Answer must be awk :)

I have some awk code that is running really slow. The format of my file is tab delimited 5 column ASCII. I am operating on column 5 to get a count of appropriate characters to alter the value in column 4.

Example input line:

10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a

If I find any "^" in $5 I want to not count it, or the following character. Then I want to find out how many characters are ">" or "<" or "*" and remove them from the count. I'm guessing using a gsub, and 3 splits is less than ideal, especially since column 5 can occasionally be a very very long string.

awk '{l=$4; if($5~/>/ || $5~/</ || $5~/*/ )  {gsub(/\^./,"");l-=split($5,a,"<")-1;l-=split($5,a,">")-1;l-=split($5,a,"*")-1}

If the code runs successfully on the line above, l will be 27.

I am omitting the surrounding parts of the command to try and focus on the part I have a question about.

So, what is the best step to make this run faster?

Upvotes: 0

Views: 1528

Answers (4)

potong
potong

Reputation: 58473

This might work for you:

echo "10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a" |
awk '/[><*^]/{t=$5;gsub(/[><*]|[\^]./,"",t);$4=length(t)}1' 
10 5134832 N 27 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a

if you want to show the amended fifth field:

 awk '/[><*^]/{gsub(/[><*]|[\^]./,"",$5);$4=length($5)}1'

Upvotes: 0

glenn jackman
glenn jackman

Reputation: 247012

Here's a guess:

awk '
    BEGIN {FS = OFS = "\t"}
    {
        str = $5
        gsub(/\^.|[><*]/, "", str)
        l = length(str)
    }
'

Upvotes: 1

Adam Liss
Adam Liss

Reputation: 48310

Do you need to use awk, or will this work instead?

cut -f 5 < $file | grep -v '^[A-Z]' | tr -d '<>*\n' | wc -c

Translation:

  • Extract the 5th field from the tab-delimited $file.
  • Remove all fields starting with a capital letter.
  • Remove the characters <, >, *, and newlines.
  • Count the remaining characters.

Upvotes: 1

Zsolt Botykai
Zsolt Botykai

Reputation: 51643

Well as I see, your gsub pattern will not work, as the / was not closed. Anyway, if I get it correctly and you want the character count of $5 without some characters, I'd go with:

count=length(gensub("[><A-Z^]","","g",$5))

You should list your skippable characters between [ and ], and do not start with ^!

Upvotes: 1

Related Questions