Reputation: 727
That is my first post. I would like to write a small script to count multiple unique repeats in a line. The text is a DNA sequence enter link description here, so the text will be combinations of four letters: A, T, G and C. If one string appears two times, it will be counted twice, and so on.
The unique strings I want to look for are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to count repeats of four or more.
Strings to count:
AGAGAG
GAGAGA
CTCTCT
TCTCTC
Example input file (two columns separated by a tab):
Sequence_1 AGAGAG
Sequence_2 AGAGAGT
Sequence_3 AGAGAGAG
Sequence_4 AGAGAT
Sequence_5 AGAGAGAGAGAGAGAGAGT
Sequence_6 AGAGAGTAGAGAG
Sequence_7 CTCTCTCTCTC
Sequence_8 TAGAGAGAT
Sequence_9 TAAGAGAGAAG
Desired output:
Sequence_1 AGAGAG 1
Sequence_2 AGAGAGT 1
Sequence_3 AGAGAGAG 0
Sequence_4 AGAGAT 0
Sequence_5 AGAGAGAGAGAGAGAGAG 0
Sequence_6 AGAGAGTAGAGAG 2
Sequence_7 CTCTCTCTCTCAAGAGAG 1
Sequence_8 TAGAGAGAT 1
Sequence_9 TAAGAGAGAAG 1
I have a small one-liner written with awk, but I think it is not specific when matching the strings:
awk '{if($1 ~ /AGAGAG/)x++; if($1 ~ /TCTCTC/)x++;if($1 ~ /GAGAGA/)x++;if($1 ~ /CTCTCT/)x++;print x;x=0}' inputfile.tab
Thanks so much for your help. All the best, Bernardo
Upvotes: 1
Views: 513
Reputation: 124646
I think there are some inconsistencies in your description and in the sample input and outputs. So this script might not be perfect, but I hope it comes close enough that you can figure out the rest:
#!/usr/bin/perl -n
my ($seq, $dna) = split(/\s+/);
my @strings = qw/AG GA CT TC/;
my $count = 0;
foreach my $s (@strings) {
my ($b, $e) = split(//, $s);
@matches = $dna =~ m/(?<!$e)($s){3}(?!$b)/g;
$count += scalar(@matches);
}
print join("\t", $seq, sprintf("%-20s", $dna), $count), "\n";
You can use it with:
./script.pl < sample.txt
For input:
Sequence_1 AGAGAG
Sequence_2 AGAGAGT
Sequence_3 AGAGAGAG
Sequence_4 AGAGAT
Sequence_5 AGAGAGAGAGAGAGAGAGT
Sequence_6 AGAGAGTAGAGAG
Sequence_7 CTCTCTCTCTCAAGAGAG
It gives:
Sequence_1 AGAGAG 1
Sequence_2 AGAGAGT 1
Sequence_3 AGAGAGAG 0
Sequence_4 AGAGAT 0
Sequence_5 AGAGAGAGAGAGAGAGAGT 0
Sequence_6 AGAGAGTAGAGAG 2
Sequence_7 CTCTCTCTCTCAAGAGAG 1
How it works:
-n
flag in the shebang, the script is executed for each line coming from stdin
@strings
is the list of strings we are interested in@strings
, we count the matches
$s
takes on the value of AG
, GA
, CT
, TC
(?<!$s)($s){3}(?!$s)
matches 3 consecutive $s
that is not followed by $s
and not preceded by $s
(?<!$e)($s){3}(?!$b)
matches 3 consecutive $s
that is not followed by the 1st character of $s
and not preceded by the 2nd character of $s
$x =~ m///g
returns an array of all matchesscalar(@matches)
is the size of the array of all matches, we add it to the countUpvotes: 1