LP_640
LP_640

Reputation: 579

counting string length before and after a match, line by line in bash or sed

I have a file 'test' of DNA sequences, each with a header or ID like so:

>new
ATCGGC
>two
ACGGCTGGG
>tre
ACAACGGTAGCTACTATACGGTCGTATTTTTT

I would like to print the length of each contiguous string before and after a match to a given string, e.g. CGG

The output would then look like this:

>new
2 1
>two
1 5
>tre 
4 11 11 

or could just have the character lengths before and after matches for each line.

2 1
1 5 
4 11 11 

My first attempts used sed to print the next line after finding '>' ,then found the byte offset for each grep match of "CGG", which I was going to use to convert to lengths, but this produced the following:

sed -n '/>/ {n;p}' test | grep -aob "CGG" 

2:CGG
8:CGG
21:CGG
35:CGG

Essentially, grep is printing the byte offset for each match, counting up, while I want the byte offset for each line independently (i.e. resetting after each line).

I suppose I need to use sed for the search as well, as it operates line by line, but Im not sure how to counnt the byte offset or characters in a given string.

Any help would be much appreciated.

Upvotes: 2

Views: 1419

Answers (1)

jas
jas

Reputation: 10865

By using your given string as the field separator in awk, it's as easy as iterating through the fields on each line and printing their lengths. (Lines starting with > we just print as they are.)

This gives the desired output for your sample data, though you'll probably want to check edge cases like starts with CGG, ends with CGG, only contains CGG, etc.

$ awk -F CGG '/^>/ {print; next} {for (i=1; i<=NF; ++i) {printf "%s%s", length($i), (i==NF)?"\n":" "}}' file.txt
>new
2 1
>two
1 5
>tre
4 11 11

awk -F CGG

Invoke awk using "CGG" as the field separator. This parses each line into a set of fields separated by each (if any) occurrence of the string "CGG". The "CGG" strings themselves are neither included as nor part of any field.

Thus the line ACAACGGTAGCTACTATACGGTCGTATTTTTT is parsed into the three fields: ACAA, TAGCTACTATA, and TCGTATTTTTT, denoted in the awk program by $1, $2, and $3, respectively.

'/^>/ {print; next}

This pattern/action tells awk that if the line starts with > to print the line and go immediately to the next line of input, without considering any further patterns or actions in the awk program.

{for (i=1; i<=NF; ++i) {printf "%s%s", length($i), (i==NF)?"\n":" "}}

If we arrive to this action, we know the line did not start with > (see above). Since there is only an action and no pattern, the action is executed for every line of input that arrives here.

The for loop iterates through all the fields (NF is a special awk variable that contains the number of fields in the current line) and prints their length. By checking if we've arrived at the last field, we know whether to print a newline or just a space.

Upvotes: 7

Related Questions