Reputation: 3
I have to count the number of overlapping dimers (AA, AG, AC, AT, GA, GG, GC, GT, CC, CG, CA, CT, TT, TA, TG, TC) in multiple sequences using Perl. I wrote the following code but it only works for one sequence. How can I extend it to multiple sequences?
#!/usr/bin/perl -w
open FH, "sample.txt";
$genome=<FH>;
%count=();
$count{substr($genome, $_, 2)}++ for 0..length($genome)-2;
print "AA: $count{AA}\n";
print "AG: $count{AG}\n";
print "AC: $count{AC}\n";
print "AT: $count{AT}\n";
print "TA: $count{TA}\n";
print "TG: $count{TG}\n";
print "TC: $count{TC}\n";
print "TT: $count{TT}\n";
print "GA: $count{GA}\n";
print "GG: $count{GG}\n";
print "GC: $count{GC}\n";
print "GT: $count{GT}\n";
print "CA: $count{CA}\n";
print "CG: $count{CG}\n";
print "CC: $count{CC}\n";
print "CT: $count{CT}\n";
I need:
input example: sample.txt
ATGGGCTCCTCCGCCATCACCGTGAGCTTCCTCCTCTTTCTGGCATTTCAGCTCCCAGGGCAAACAGGAGCAAATCCCGTGTATGGCTCTGTGTCCAATGCAGACCTGATGGATTTCAAGTAAAAG
ATGGTGAGAAAATGGGCCCTGCTCCTGCCCATGCTGCTCTGCGGCCTGACTGGTCCCGCACACCTCTTCCAGCCAAGCCTGGTGCTGGAGATGGCCCAGGTCCTCTTGGATAACTACTGCTTCCCAGAGAACCTGATGGGGATGCAGGGAGCCATCGAGCAGGCCATCAAAAGTCAGGAGATTCTGTCTATCTCAGACCCTCAGACTCTGGCCCATGTGTTGACAGCTGGGGTGCAGAGCTCCTTGAATGACCCTCGCCTGGTCATCTCCTATGAGCCCAGCACCCTCGAGGCCCCTCCGCGAGCTCCAGCAGTCACGAACCTCACACTAGAGGAAATCATCGCAGGGCTGCAGGATGGCCTTCGCCATGAGATTCTGGAAGGCAATGTGGGCTACCTGCGGGTGGACGACATCCCGGGCCAGGAGGTGATGAGCAAGCTGAGGAGCTTCCTGGTGGCCAACGTCTGGAGGAAGCTCGTGAACACGTCCGCCTTGGTGCTGGACCTCCGGCACTGCACTGGGGGACACGTGTCTGGCATCCCCTATGTCATCTCCTACCTGCACCCAGGGAGCACAGTCTCGCACGTGGACACCGTCTACGACCGCCCCTCCAACACAACCACTGAGATCTGGACCCTGCCTGAAGCCCTGGGAGAGAAGTACAGTGCAGACAAGGATGTGGTGGTCCTCACCAGCAGCCGCACGGGGGGCGTGGCTGAGGACATCGCTTACATCCTCAAACAGATGCGCAGGGCCATCGTGGTGGGCGAGCGGACTGTTGGGGGGGCTCTGAACCTCCAGAAGCTGAGGGTAGGCCAGTCCGACTTCTTTCTCACTGTGCCTGTGTCCAGATCCCTGGGGCCCCTGGGTGAGGGCAGCCAGACGTGGGAGGGCAGTGGGGTGCTGCCCTGTGTGGGGACACCGGCCGAGCAGGCCCTGGAGAAAGCCCTGGCCGTTCTCATGCTGCGCAGGGCCCTGCCAGGAGTCATTCAGCGCCTTCAGGAGGCGCTGCGCGAGTACTACACGCTGGTGGACCGTGTGCCCGCCCTGCTGAGCCACCTGGCCGCCATGGACCTGTCCTCGGTGGTCTCCGAGGACGATCTGGTCACTAAGCTCAATGCTGGCCTGCAGGCTGTGTCTGAGGACCCCAGGCTCCAGGTGCAGGTGGTCAGACCCAAAGAAGCCTCTTCTGGGCCTGAGGAAGAAGCTGAAGAACCTCCAGAGGCGGTCCCGGAAGTGCCCGAGGACGAGGCTGTTCGGCGGGCTCTGGTGGACTCCGTGTTCCAGGTTTCTGTGCTGCCGGGCAACGTGGGCTACCTGCGCTTCGACAGTTTCGCTGATGCCTCTGTCCTGGAGGTGCTGGGCCCCTACATCCTGCACCAGGTGTGGGAGCCCCTGCAGGACACGGAGCACCTCATCATGGACCTGCGGCAGAACCCCGGGGGGCCGTCCTCCGCGGTGCCCCTGCTGCTCTCCTACTTCCAGAGCCCTGACGCCAGCCCCGTGCGCCTCTTCTCCACCTACGACCGGCGCACCAACATCACACGCGAGCACTTCAGCCAGACGGAGCTGCTGGGCCGGCCCTACGGCACCCAGCGTGGCGTGTACCTGCTCACTAGCCACCGCACCGCCACCGCGGCCGAGGAGCTGGCCTTCCTCATGCAGTCACTGGGCTGGGCCACGCTGGTGGGCGAGATCACCGCGGGCAGCCTGCTGCACACACACACAGTATCCCTGCTGGAGACGCCCGAGGGCGGCCTGGCGCTCACGGTGCCTGTGCTCACCTTCATCGACAACCATGGCGAGTGCTGGCTGGGGGGCGGTGTGGTCCCCGATGCCATTGTGCTGGCCGAGGAAGCCCTAGACAGAGCCCAGGAGGTGCTGGAGTTCCACCGAAGCTTGGGGGAGTTGGTGGAAGGCACGGGGCGCCTGCTGGAGGCCCACTACGCTCGGCCAGAGGTCGTGGGGCAGATGGGTGCCCTGCTGCGAGCCAAGCTGGCCCAGGGGGCCTACCGCACCGCGGTGGACCTGGAGTCGCTGGCTTCCCAGCTTACGGCCGACCTGCAGGAGATGTCTGGGGACCACCGTCTGCTGGTGTTCCACAGCCCCGGCGAAATGGTGGCTGAGGAGGCGCCCCCACCGCCTCCCGTCGTCCCCTCCCCGGAGGAGCTGTCCTATCTCATCGAGGCCCTGTTCAAGACTGAGGTGCTGCCCGGCCAGCTGGGCTACCTGCGTTTCGACGCCATGGCTGAGCTGGAGACGGTGAAGGCCGTCGGGCCACAGCTGGTGCAGCTGGTGTGGCAGAAGCTGGTGGACACGGCCGCGCTGGTGGTCGACCTGCGCTACAACCCCGGCAGCTACTCCACAGCCGTGCCTCTACTCTGCTCCTACTTCTTCGAGGCAGAGCCCCGCCGGCACCTCTACTCTGTCTTTGACAGGGCCACGTCAAGGGTCACAGAGGTATGGACCCTGCCCCACGTTACAGGCCAGCGCTATGGCTCCCACAAGGACCTCTACGTTCTGGTGAGCCACACCAGCGGTTCAGCAGCTGAGGCTTTTGCTCACACCATGCAGGATCTGCAGCGAGCCACCATCATCGGGGAGCCCACGGCCGGAGGGGCACTCTCCGTGGGAATCTACCAGGTGGGCAGCAGCGCCTTATACGCCTCCATGCCCACGCAGATGGCCATGAGTGCCAGCACCGGCGAGGCCTGGGATCTGGCTGGGGTGGAGCCGGACATCACTGTGCCCATGAGCGTGGCCCTCTCCACAGCCCGGGACATAGTGACCCTGCGTGCCAAGGTGCCCACTGTGCTGCAGACAGCTGGGAAGCTCGTAGCGGATAACTACGCCTCCCCTGAGCTGGGAGTCAAGATGGCAGCCGAACTGAGTGGTCTGCAGAGCCGCTATGCCAGGGTGACCTCAGAAGCCGCCCTCGCCGAGCTGCTGCAAGCCGACCTGCAGGTGCTGTCCGGGGACCCACACCTGAAGACAGCTCATATACCTGAGGATGCCAAAGACCGCATTCCTGGCATTGTACCCATGCAGTAACAG
ATGGACATGATGGACGGCTGCCAGTTCTCGCCCTCTGAGTACTTCTACGACGGCTCCTGCATCCCATCCCCCGACGGTGAGTTCGGGGACGAGTTTGAGCCGCGAGTGGCTGCTTTCGGGGCTCACAAGGCAGACCTGCAAGGCTCAGACGAGGACGAGCACGTGCGAGCACCCACGGGCCACCACCAGGCCGGCCACTGCCTCATGTGGGCCTGCAAAGCATGCAAAAGGAAGTCCACCACCATGGATCGGCGGAAGGCGGCCACCATGCGCGAGCGGAGACGCCTGAAGAAGGTCAACCAGGCTTTCGACACGCTCAAGCGGTGCACCACGACCAACCCTAACCAGAGGCTGCCCAAGGTGGAGATCCTCAGGAATGCCATCCGCTACATTGAGAGTCTGCAGGAGCTGCTTAGGGAACAGGTGGAAAACTACTATAGCCTGCCGGGGCAGAGCTGCTCTGAGCCCACCAGCCCCACCTCAAGTTGCTCTGATGGCATGTAAATG
Upvotes: 0
Views: 439
Reputation: 132802
My first thought was a global match in scalar context with an adjustment to pos() to back up one. This way, I only have to scan the string once for all dimers (whereas the other answers so far scan the sequence once for every dimer). I maintain too hashes; one for the total and one for the sequence. I'm using the $.
special variable that holds the current input line number to label the sequence:
use Data::Printer;
while( <DATA> ) {
while( /\G([ATGC]{2})/g ) {
$total{$1}++;
$by_sequence{$.}{$1}++;
pos() = pos() - 1
}
}
p( %total );
p( %by_sequence );
Doing the same thing with substr took a bit much more work because I couldn't limit the substring to just the characters that mattered. I had to pay attention to newlines and odd counts at the end:
use Data::Printer;
LINE: while( <DATA> ) {
chomp;
my $pos = 0;
SUB: while( my $sub = substr( $_, $pos, 2 ) ) {
last SUB unless 2 == length $sub;
$total{$sub}++;
$by_sequence{$.}{$sub}++;
$pos++;
}
}
p( %total );
p( %by_sequence );
The output from Data::Printer is very nice. I've been using it in preference to Data::Dumper since I found out about it at YAPC::Brasil. (And, by the way, its author will be at YAPC::NA in Madison this year to talk about it):
{
AA 101,
AC 215,
AG 268,
AT 106,
CA 286,
CC 388,
CG 201,
CT 310,
GA 239,
GC 376,
GG 369,
GT 168,
TA 61,
TC 206,
TG 317,
TT 73
}
{
1 {
AA 9,
AC 3,
AG 8,
AT 8,
CA 11,
CC 12,
CG 3,
CT 11,
GA 5,
GC 9,
GG 8,
GT 6,
TA 2,
TC 13,
TG 10,
TT 7
},
2 {
AA 68,
AC 177,
AG 219,
AT 80,
CA 230,
CC 329,
CG 168,
CT 264,
GA 195,
GC 316,
GG 317,
GT 146,
TA 50,
TC 169,
TG 271,
TT 54
},
3 {
AA 24,
AC 35,
AG 41,
AT 18,
CA 45,
CC 47,
CG 30,
CT 35,
GA 39,
GC 51,
GG 44,
GT 16,
TA 9,
TC 24,
TG 36,
TT 12
}
}
Upvotes: 0
Reputation: 25873
#!/usr/bin/perl
use strict;
my @dimers = qw(AA AG AC AT GA GG GC GT CC CG CA CT TT TA TG TC);
my @dimers_totals;
my $i = 1;
for(<>)
{
my $sequence = $_;
print("Sequence $i:\n");
my $j = 0;
for(@dimers)
{
my $number =()= $sequence =~ /$_/gi;
$dimers_totals[$j++] += $number;
print "\t$_: $number\n"
}
print("\n");
$i++;
}
print("Totals:\n");
$i = 0;
for(@dimers)
{
print("\t$_: $dimers_totals[$i++]\n");
}
Use like
./dimers_count.pl < sample.txt
Upvotes: 1
Reputation: 4311
The key to this is take what you had so far as a functional unit of what you were trying to build. Your code worked for one sequence; extending it to multiple sequences is simply a matter of getting rid of the assumption that only one sequence is being considered:
use strict;
use warnings;
use 5.010;
use Data::Dumper ();
sub count_dimers {
my ($sequence) = @_;
my %counts;
$counts{substr($sequence, $_, 2)}++ for 0..length($sequence) - 2;
my @dimers = qw(AA AG AC AT GA GG GC GT CC CG CA CT TT TA TG TC);
%counts = map { $_ => $counts{$_} } @dimers;
return %counts;
}
open(my $fh, '<', 'sample.txt');
my @counts_by_sequence;
while (my $sequence = <$fh>) {
my %sequence_counts = count_dimers($sequence);
push @counts_by_sequence, \%sequence_counts;
}
my %total_counts;
for my $sequence_counts (@counts_by_sequence) {
for my $dimer (keys %$sequence_counts) {
$total_counts{$dimer} += ${ $sequence_counts}{$dimer};
}
}
say Data::Dumper->Dump(
[\%total_counts, \@counts_by_sequence],
[qw(total_count counts_by_sequence)]
);
I left the output as an exercise to the reader, but the general shape of the transformation should be evident: What was once the entirety of the program is now a function that gets called once per sequence with the results per call stored and the results over all calls totaled.
Upvotes: 0