cooldood3490
cooldood3490

Reputation: 2498

How to split the entire string into array in Perl

I'm trying to process an entire string but the way my code is written, part of it is not being processed. Here's a representation of my code:

#!/usr/bin/perl
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
              VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
              CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
              LRDVVVGRHPLHLLEDAVTKPELRPCPTP";

$string =~ s/\s+//g;     # remove white space from string
# split the string into fragments of 58 characters and store in array
my @array = $string =~ /[A-Z]{58}/g;   
my $len = scalar @array;
print $len . "\n";    # this prints 3
# print the fragments
print $array[0] . "\n";
print $array[1] . "\n";
print $array[2] . "\n";
print $array[3] . "\n";

The code outputs the following:

3
MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD
PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF
VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL
<blank space> 

Notice that the rest of the string EDAVTKPELRPCPTP is not stored in @array. When I'm creating my array, how do I store EDAVTKPELRPCPTP? Perhaps I could store it in $array[3]?

Upvotes: 0

Views: 113

Answers (4)

Borodin
Borodin

Reputation: 126762

You may prefer to use unpack, like this

$string =~ s/\s+//g;    
my @fragments = unpack '(A58)*', $string;

Or if you would rather leave $string unchanged and have v5.14 or better of Perl, then you can write

my @fragments = unpack '(A58)*', $string =~ s/\s+//gr;

Upvotes: 2

Matt Jacob
Matt Jacob

Reputation: 6553

If you don't actually need regex character classes, this is how I'd do it:

use strict;
use warnings;
use Data::Dump;

my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
              VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
              CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
              LRDVVVGRHPLHLLEDAVTKPELRPCPTP";

$string =~ s/\s+//g;

my @chunks;

while (length($string)) {
    push(@chunks, substr($string, 0, 58, ''));
}

dd($string, \@chunks);

Output:

(
  "",
  [
    "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD",
    "PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF",
    "VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL",
    "EDAVTKPELRPCPTP",
  ],
)

Upvotes: 1

Axeman
Axeman

Reputation: 29854

What you're missing is the ability to capture less than 58 characters. And since you only want to do that if it's the end, you can do this:

/[A-Z]{58}|[A-Z]{1,57}\z/

Which I would prefer to write like this:

/\p{Upper}{58}|\p{Upper}{1,57}\z/

However, since this expression is greedy by default, it will prefer to gather 58 characters, and only default to less when it runs out of matching input.

/\p{Upper}{1,58}/

Or, for reasons as Schwern mentions (such as avoiding any foreign letters)

/[A-Z]{1,58}/

Upvotes: 2

Schwern
Schwern

Reputation: 165546

You've almost got it. You need to change your regex to allow for 1 to 58 characters.

my @array = $string =~ /[A-Z]{1,58}/g;

In addition, you have an error in your script using @prot_seq instead of @array. You should always use strict to protect yourself against this sort of thing. Here's the script with strict, warnings, and 5.10 features (to get say).

#!/usr/bin/perl

use strict;
use warnings;
use v5.10;

my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
              VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
              CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
              LRDVVVGRHPLHLLEDAVTKPELRPCPTP";

# Strip whitespace.
$string =~ s/\s+//g;

# Split the string into fragments of 58 characters or less
my @fragments = $string =~ /[A-Z]{1,58}/g;

say "Num fragments: ".scalar @fragments;
say join "\n", @fragments;

Upvotes: 5

Related Questions