Reputation: 311
I want to count the number of occurrences of nucleotides (the letters 'A, T, G, C' in a string). I was trying to use the tr///
operator for this, but it returns a count of zero every time in the below code.
This only happens if I use a variable inside the tr///
operator. If I type the individual letters separately it works. I wanted to know if we could use variables inside tr///
operator for pattern matching (and counting). And if we can, someone tell me how to modify my code.
Later I plan to count the number of codons (~64). Hence the trouble. Appreciate your time. Thanks!
#!/usr/bin/perl
use strict;
use warnings;
my $orf = "ATGCTAGCTAGCATAGAGCTAGCTA";
my @atgc = qw(A T G C);
my %hash = ();
foreach my $nt(@atgc) {
$hash{$nt} = ($orf =~ tr/$nt//);
}
Upvotes: 1
Views: 1847
Reputation: 139631
Make a pattern compiler.
sub make_counter {
my @sequences = @_;
my $pattern = "(?:" . join("|", map quotemeta, @sequences) . ")";
my $compiled = eval q<
sub {
local($_) = @_;
my $n = () = /$pattern/g;
}
>;
if (defined $compiled) {
return $compiled;
}
else {
die "$0: internal: counter compilation failed:\n$@\n";
}
}
With quotemeta
, we force all characters in the sequence to match themselves only with no special meanings. Section 4 of the Perl FAQ describes that funky bit for counting matches:
Another version uses a global match in list context, then assigns the result to a scalar, producing a count of the number of matches.
$count = () = $string =~ /-\d+/g;
Note that it tolerates junk in your sequence. For example, count nucleotides with
my @nucleotides = qw/ G A T C /;
my $numnuc = make_counter @nucleotides;
print $numnuc->("xGxAxTxxxxTyA1C2A"), "\n";
Output:
7
Count codons with
my @codons = qw(
TTT TCT TAT TGT TTC TCC TAC TGC TTA TCA TAA TGA
TTG TCG TAG TGG CTT CCT CAT CGT CTC CCC CAC CGC
CTA CCA CAA CGA CTG CCG CAG CGG ATT ACT AAT AGT
ATC ACC AAC AGC ATA ACA AAA AGA ATG ACG AAG AGG
GTT GCT GAT GGT GTC GCC GAC GGC GTA GCA GAA GGA
GTG GCG GAG GGG
);
my $numcod = make_counter @codons;
print $numcod->("GAG-GGG!AGG,TAT#TTT");
Note that any junk, if present, must occur between codon sequences.
Output:
5
Upvotes: 1
Reputation: 126742
Unfortunately Perl won't interpolate a variable into the search list for tr///
. You will have to use a regular expression instead:
use strict;
use warnings;
my $orf = "ATGCTAGCTAGCATAGAGCTAGCTA";
my @atgc = qw(A T G C);
my %count;
$count{$1}++ while $orf =~ /([@atgc])/g;
printf "%s => %d\n", $_, $count{$_} for @atgc;
output
A => 8
T => 6
G => 6
C => 5
Beware
This isn't - as it may appear - a general solution for matching any one of an array of strings. What is happening is @atcg
is being interpolated into the regex as if it was in a double-quoted string. That means Perl will use the $"
built-in variable (set to a single space by default) to turn it into a string equivalent to join $", @atgc
.
So the code actually looks like
$count{$1}++ while $orf =~ /([A T G C])/g;
which will match spaces as well as the letters, and may break altogether if @atgc
contains anything that is special inside a regex character class like ^
, ]
or -
.
Counting spaces unnecessarily shouldn't be a problem, but if your list may contain symbols then this isn't a solution you should be using.
A count for every ASCII character can be written safely as
$count{$1}++ while $orf =~ /(.)/sg;
and the unwanted information in the %count
hash can simply be ignored.
Upvotes: 4
Reputation: 11566
You don't need regular expressions etc. to do this, you just need to walk a string:
my $orf = "ATGCTAGCTAGCATAGAGCTAGCTA";
my %nt;
$nt{$_}++ foreach (split('', $orf));
Upvotes: 1
Reputation: 118158
You could use s///g
:
#!/usr/bin/env perl
use strict; use warnings;
my $orf = "ATGCTAGCTAGCATAGAGCTAGCTA";
my @atgc = qw(A T G C);
my %hash = map {$_ => $orf =~ s/$_/$_/g } @atgc;
Output:
--- A: 8 C: 5 G: 6 T: 6
If the source string is going to be really long, you can treat it as holding the contents of a while and read one byte at a time assuming the characters are guaranteed to be ASCII.
#!/usr/bin/env perl
use strict; use warnings;
my $orf = "ATGCTAGCTAGCATAGAGCTAGCTA";
my %atgc = map { $_ => undef } qw(A T G C);
use YAML;
print Dump count(\$orf, \%atgc);
sub count {
my $src_ref = shift;
my $lookup = shift;
my %count;
open my $src, '<', $src_ref or die $!;
{
local $/ = \1;
exists $lookup->{$_} and $count{ $_ } += 1 while <$src>;
}
close $src;
return \%count;
}
Upvotes: 1
Reputation: 1416
$hash{$nt} = eval "\$orf =~ tr/\Q$nt\E//"
should do the job. Maybe this is not the most effective solution though.
Upvotes: 3
Reputation: 386396
There is no instance of "$
", "n
" or "t
" in "ATGCTAGCTAGCATAGAGCTAGCTA
", thus tr
correctly returns zero.
If you want to build a tr///
operator, you'll need to parse and compile it.
my %counts;
for my $nt (@atgc) {
$counts{$nt} = eval "\$orf =~ tr/\Q$nt\E//";
}
But I wouldn't use tr///
.
my %atgc = map { $_ => 1 } @atgc;
my %counts;
++$counts{$nt} for grep $atgc{$_}, split //, $orf;
Upvotes: 4