Vasilis
Vasilis

Reputation: 173

how to extract substrings by knowing the coordinates

I am terribly sorry for bothering you with my problem in several questions, but I need to solve it...

I want to extract several substrings from a file whick contains string by using another file with the begin and the end of each substring that I want to extract. The first file is like:

>scaffold30     24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................

(the string must be everything in the file except the first line), and the coordinates file is like:

44801988    44802104
44846151    44846312
45620133    45620274
45640443    45640543
45688249    45688358
45729531    45729658
45843362    45843490
46066894    46066996
46176337    46176464
.....................

my script is this:

my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];

#finds  subsequences: fasta files



open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;

while(<INFILE1>) {
    if ($_ !~ m/^>/) {

    local $/ = undef;
    my $var = <INFILE1>;

    open INFILE, $coords_file or die "Could not open $coords_file: $!";
           my @cline = <INFILE>;
    foreach my $cline (@cline) {
    print "$cline\n";
            my@data = split('\t', $cline);
            my $start = $data[0];
            my $end = $data[1];
            my $offset = $end - $start;
           $count++;
           my $sub = substr ($var, $start, $offset);
           print ">conserved $count\n";
           print "$sub\n";

    }
    close INFILE;
    }
}

when I run it, it looks like it does only one iteration and it prints me the start of the first file. It seems like the foreach loop doesn't work. also substr seems that doesn't work. when I put an exit to print the cline to check the loop, it prints all the lines of the file with the coordinates.

I am sorry if I become annoying, but I must finish it and I am a little bit desperate...

Thank you again.

Upvotes: 0

Views: 573

Answers (2)

Chris Charley
Chris Charley

Reputation: 6573

As 'ThisSuitIsBlackNot' suggested, your code could be cleaned up a little. Here is a possible solution that may be what you want.

#!/usr/bin/perl
use strict;
use warnings;

my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];

#finds  subsequences: fasta files

open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;

<INFILE1>; # get rid of the first line - '>scaffold30     24194'

while(<INFILE1>) {
    chomp;
    $fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";

open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;

while(<INFILE>) {
    my ($start, $end) = split;

    # Or, should this be: my $offset = $end - ($start - 1);
    # That would include the start fasta
    my $offset = $end - $start;

    $count++;
    my $sub = substr ($fasta, $start, $offset);
    print ">conserved $count\n";
    print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";

Upvotes: 1

ThisSuitIsBlackNot
ThisSuitIsBlackNot

Reputation: 24063

This line

local $/ = undef;

changes $/ for the entire enclosing block, which includes the section where you read in your second file. $/ is the input record separator, which essentially defines what a "line" is (it is a newline by default, see perldoc perlvar for details). When you read from a filehandle using <>, $/ is used to determine where to stop reading. For example, the following program relies on the default line-splitting behavior, and so only reads until the first newline:

my $foo = <DATA>;
say $foo;
# Output:
# 1

__DATA__
1
2
3

Whereas this program reads all the way to EOF:

local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3

__DATA__
1
2
3

This means your @cline array gets only one element, which is a string containing the text of your entire coordinates file. You can see this using Data::Dumper:

use Data::Dumper;

print Dumper(\@cline);

Which in your case will output something like:

$VAR1 = [
          '44801988    44802104
44846151    44846312
45620133    45620274
45640443    45640543
45688249    45688358
45729531    45729658
45843362    45843490
46066894    46066996
46176337    46176464
'
        ];

Notice how your array (technically an arrayref in this case), delineated by [ and ], contains only a single element, which is a string (delineated by single quotes) that contains newlines.

Let's walk through the relevant sections of your code:

while(<INFILE1>) {
    if ($_ !~ m/^>/) {
        # Enable localized slurp mode. Stays in effect until we leave the 'if'
        local $/ = undef;

        # Read the rest of INFILE1 into $var (from current line to EOF)
        my $var = <INFILE1>;

        open INFILE, $coords_file or die "Could not open $coords_file: $!";

        # In list context, return each block until the $/ character as a
        # separate list element. Since $/ is still undef, this will read
        # everything until EOF into our first list element, resulting in
        # a one-element array
        my @cline = <INFILE>;

        # Since @cline only has one element, the loop only has one iteration
        foreach my $cline (@cline) {

As a side note, your code could be cleaned up a bit. The names you chose for your filehandles leave something to be desired, and you should probably use lexical filehandles anyway (and the three-argument form of open):

open my $chromosome_fh,  "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;

Also, you do not need to nest your loops in this case, it just makes your code more convoluted. First read the relevant parts of your chromosome file into a variable (named something more meaningful than var):

# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
    next if /^>/;
    $chromosome .= $_;
}

Then read in your coordinates file:

my @cline = <$coordinates_fh>;

Or if you only need to use the contents of the coordinates file once, process each line as you go using a while loop:

while (<$coordinates_fh>) {
    # Do something for each line here
}

Upvotes: 2

Related Questions