Nico64
Nico64

Reputation: 173

How to print the length size of the following line

I would like to modify a file by including the size of following line using awk. My file is like this:

>AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)  
ATGTCGATGCTCGATC  
>AAAS::1215902:1215986:-:1::NW_015494524.1:1215902-1215986(-) 
ATGCGATGCTAGCTAGCTCGAT  
>AAAS:1215614:1215701:-:1::NW_015494524.1:1215614-1215701(-)
ATGCCGCGACGCAGCACCCGACGCGCAG

I am using awk to modify it to have the following format:

>Assembly_AAAS_1_16  
ATGTCGATGCTCGATC  
>Assembly_AAAS_2_22  
ATGCGATGCTAGCTAGCTCGAT  
>Assembly_AAAS_3_28  
ATGCCGCGACGCAGCACCCGACGCGCAG

I have used awk to modify the first part.

awk -F":" -v i=1 '/>/{print ">Assembly_" $1 "_" val i "_";i++;next} {print length($0)} 1' infile | sed -e "s/_>/_/g" > outfile

I can use print length($0) but how to print it in the same line?

Thanks

Upvotes: 1

Views: 1241

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133428

EDIT2: Since OP has changed the sample data again so adding this code now.

awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {sub(/ +$/,"");print val length($0) ORS $0}'  Input_file

OR

awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {print val length($1) ORS $0;}'  Input_file

Above will remove spaces from last of the lines of Input_file, in case you don't need it then remove sub(/ +$/,""); part from above code please.


EDIT: As per OP changed solution now.

awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{value="\047" val i val1;i++;next} {print value length($0) ORS $0}' Input_file

OR

awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '
/>/{  value="\047" val i val1;
      i++;
      next}
{
      print value length($0) ORS $0
}
'   Input_file

Following awk may help you on same.

awk -v i="" -v j=2 '/>/{print "\047>Assembly_GeneName1_"++i"_sizeline"j;j+=2;next} 1'  Input_file

Solution 2nd:

awk -v i=1 -v j=2 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{print "\047" val i val1 j;j+=2;i++;next} 1'  Input_file

Upvotes: 2

kvantour
kvantour

Reputation: 26471

What you are dealing with is a beautiful example of records which are not lines. awk is a record parser and by default, a record is defined to be a line. With awk you can define a record to be a block of text using the record separator RS.

RS : The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.

So the goal is to define the record to be

AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)  
ATGTCGATGCTCGATC

And this can be done by defining the RS="\n<". Furthremore we will use \n as a field separator FS. This way you can get the requested length as length($2) and the count by using the record count NR.

A simple awk script is then:

awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
     {$1=">Assembly_AAAS_"NR"_"length($2)}
     {print $1,$2}' <file>

This will do exactly what you want.

note: we use print $1,$2 and not print $0 as the last record might have 3 fields (if the last char of the file is a newline). This would imply that you would have an extra empty line at the end of your file.

If you want to pick the AAAS string out of $1 you can use substr($1,1,match($1,":")-1) to pick it up. This results in this:

awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
     {$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)}
     {print $1,$2}' <file>

Finally, be aware that the above solution only works if there are no spaces in $2, if you want to change that, you can do this :

awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
     { gsub(/[[:blank:]]/,"",$2);
       $1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)
     }
     { print $1,$2 }' <file>

Upvotes: 1

Related Questions