how to extract substrings by knowing the coordinates

Question

I am terribly sorry for bothering you with my problem in several questions, but I need to solve it...

I want to extract several substrings from a file whick contains string by using another file with the begin and the end of each substring that I want to extract. The first file is like:

>scaffold30     24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................

(the string must be everything in the file except the first line), and the coordinates file is like:

44801988    44802104
44846151    44846312
45620133    45620274
45640443    45640543
45688249    45688358
45729531    45729658
45843362    45843490
46066894    46066996
46176337    46176464
.....................

my script is this:

my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];

#finds  subsequences: fasta files



open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;

while() {
    if ($_ !~ m/^>/) {

    local $/ = undef;
    my $var = ;

    open INFILE, $coords_file or die "Could not open $coords_file: $!";
           my @cline = ;
    foreach my $cline (@cline) {
    print "$cline
";
            my@data = split('	', $cline);
            my $start = $data[0];
            my $end = $data[1];
            my $offset = $end - $start;
           $count++;
           my $sub = substr ($var, $start, $offset);
           print ">conserved $count
";
           print "$sub
";

    }
    close INFILE;
    }
}

when I run it, it looks like it does only one iteration and it prints me the start of the first file. It seems like the foreach loop doesn't work. also substr seems that doesn't work. when I put an exit to print the cline to check the loop, it prints all the lines of the file with the coordinates.

I am sorry if I become annoying, but I must finish it and I am a little bit desperate...

Thank you again.

Chris Charley · Accepted Answer

As 'ThisSuitIsBlackNot' suggested, your code could be cleaned up a little. Here is a possible solution that may be what you want.

#!/usr/bin/perl
use strict;
use warnings;

my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];

#finds  subsequences: fasta files

open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;

; # get rid of the first line - '>scaffold30     24194'

while() {
    chomp;
    $fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";

open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;

while() {
    my ($start, $end) = split;

    # Or, should this be: my $offset = $end - ($start - 1);
    # That would include the start fasta
    my $offset = $end - $start;

    $count++;
    my $sub = substr ($fasta, $start, $offset);
    print ">conserved $count
";
    print "$sub
";
}
close INFILE or die "Could not close '$coords_file'. $!";

how to extract substrings by knowing the coordinates

Answers (2)

Related Questions