biotech
biotech

Reputation: 727

Count multiple unique strings in a line

That is my first post. I would like to write a small script to count multiple unique repeats in a line. The text is a DNA sequence enter link description here, so the text will be combinations of four letters: A, T, G and C. If one string appears two times, it will be counted twice, and so on.

The unique strings I want to look for are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to count repeats of four or more.

Strings to count:

AGAGAG
GAGAGA
CTCTCT
TCTCTC

Example input file (two columns separated by a tab):

Sequence_1    AGAGAG                   
Sequence_2    AGAGAGT                  
Sequence_3    AGAGAGAG                 
Sequence_4    AGAGAT                   
Sequence_5    AGAGAGAGAGAGAGAGAGT      
Sequence_6    AGAGAGTAGAGAG 
Sequence_7    CTCTCTCTCTC  
Sequence_8    TAGAGAGAT                
Sequence_9    TAAGAGAGAAG              

Desired output:

Sequence_1    AGAGAG                   1
Sequence_2    AGAGAGT                  1
Sequence_3    AGAGAGAG                 0
Sequence_4    AGAGAT                   0
Sequence_5    AGAGAGAGAGAGAGAGAG       0
Sequence_6    AGAGAGTAGAGAG            2
Sequence_7    CTCTCTCTCTCAAGAGAG       1 
Sequence_8    TAGAGAGAT                1
Sequence_9    TAAGAGAGAAG              1

I have a small one-liner written with awk, but I think it is not specific when matching the strings:

awk '{if($1 ~ /AGAGAG/)x++; if($1 ~ /TCTCTC/)x++;if($1 ~ /GAGAGA/)x++;if($1 ~ /CTCTCT/)x++;print x;x=0}' inputfile.tab

Thanks so much for your help. All the best, Bernardo

Upvotes: 1

Views: 513

Answers (1)

janos
janos

Reputation: 124646

I think there are some inconsistencies in your description and in the sample input and outputs. So this script might not be perfect, but I hope it comes close enough that you can figure out the rest:

#!/usr/bin/perl -n

my ($seq, $dna) = split(/\s+/);
my @strings = qw/AG GA CT TC/;
my $count = 0;
foreach my $s (@strings) {
    my ($b, $e) = split(//, $s);
    @matches = $dna =~ m/(?<!$e)($s){3}(?!$b)/g;
    $count += scalar(@matches);
}
print join("\t", $seq, sprintf("%-20s", $dna), $count), "\n";

You can use it with:

./script.pl < sample.txt

For input:

Sequence_1    AGAGAG
Sequence_2    AGAGAGT
Sequence_3    AGAGAGAG
Sequence_4    AGAGAT
Sequence_5    AGAGAGAGAGAGAGAGAGT
Sequence_6    AGAGAGTAGAGAG
Sequence_7    CTCTCTCTCTCAAGAGAG

It gives:

Sequence_1    AGAGAG                1
Sequence_2    AGAGAGT               1
Sequence_3    AGAGAGAG              0
Sequence_4    AGAGAT                0
Sequence_5    AGAGAGAGAGAGAGAGAGT   0
Sequence_6    AGAGAGTAGAGAG         2
Sequence_7    CTCTCTCTCTCAAGAGAG    1

How it works:

  • Thanks to the -n flag in the shebang, the script is executed for each line coming from stdin
  • @strings is the list of strings we are interested in
  • For each item in @strings, we count the matches
    • $s takes on the value of AG, GA, CT, TC
    • The expression (?<!$s)($s){3}(?!$s) matches 3 consecutive $s that is not followed by $s and not preceded by $s
    • The expression (?<!$e)($s){3}(?!$b) matches 3 consecutive $s that is not followed by the 1st character of $s and not preceded by the 2nd character of $s
    • The operation $x =~ m///g returns an array of all matches
    • scalar(@matches) is the size of the array of all matches, we add it to the count

Upvotes: 1

Related Questions