Ed Tickle
Ed Tickle

Reputation: 87

Print next line after successful match in FASTA header

I have some successful code that pulls out the range information from a FASTA title if a numerical position (from another file) lies inside it, and prints the regex captures and original position as a result.

File 1 sample data:

7065_8#10   9436    -   t
7065_8#10   126477  -   c
7065_8#10   413711  +   T

File 2 sample data:

>SAEMRSA15_00020 dnaN_DNA_polymerase_III,_beta_chain 2156  3289 forward
ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCACCAAGAACAACATTACCT
>SAEMRSA15_00060 gyrA_DNA_gyrase_subunit_A 7005  9674 forward
ATGTCGGAAAAAGAAATTTGGGA

Code:

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";

open FILE1, "/Users/edwardtickle/Documents/CC22indels.tab";

open FILE2, "/Users/edwardtickle/Documents/CC22_CDS_rmmge.aln";

open( OUTPUTFILE, ">$outputfile" );
my @file1list = ();

while (<FILE1>) {
    if (/^\S+\s+(\d+)/) {
        push @file1list, $1;
    }
}

close FILE1;

while (<FILE2>) {
    if (/^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/) {
        my $cds1 = $1;
        my $cds2 = $2;
        my $cds3 = $3;
        my $cds4 = $4;

        for my $cc22 (@file1list) {
            if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
                print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4\n";
            }
        }
    }
}

close FILE2;

eg output:

9436 7005 9674 forward

Alongside this captured information, i would like to print the next line of the FASTA file after a match, which includes the sequence data for that gene. I would like the next line to be printed on the same line after the original data. This sounds incredibly simple on paper but i can't work out how to do it! I have attempted to use a previous answer and incorporate it into my code to no avail (shown below). If possible could you please adapt my code instead of suggesting a completely new method for the original correct code, i am trying to make sure i understand every bit of script i use as i go along rather than simply pasting in an answer.

Desired output:

9436 7005 9674 forward ATGTCGGAAAAAGAAATTTGGGA

Incorrect code:

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";

open FILE1, "/Users/edwardtickle/Documents/CC22indels.tab";

open FILE2, "/Users/edwardtickle/Documents/CC22_CDS_rmmge.aln";

open( OUTPUTFILE, ">$outputfile" );
my @file1list = ();

while (<FILE1>) {
    if (/^\S+\s+(\d+)/) {
        push @file1list, $1;
    }
}

my $nextline = 0;
close FILE1;

while ( my $line = <FILE2> ) {
    if (/^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/) {
        my $cds1 = $1;
        my $cds2 = $2;
        my $cds3 = $3;
        my $cds4 = $4;

        for my $cc22 (@file1list) {
            if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
                if ($nextline) {
                    print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4 $nextline\n";
                    $nextline = ( $line =~ /^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/ );
                }
            }
        }
    }
}

close FILE2;

Thanks in advance!

Upvotes: 0

Views: 108

Answers (2)

i alarmed alien
i alarmed alien

Reputation: 9520

The problem is that you never give $nextline a value:

    for my $cc22 (@file1list) {
        if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
            if ($nextline) {
               ...
            }
        }
    }

There's nowhere in that loop for $nextline to get set, so the if ($nextline) statement is never executed. To change that, you need to change your code so $nextline does get initialised. Since you haven't posted any input data, it's a little difficult to know exactly what should be done, but assuming you get a match on one line and then want to print some details and the sequence from the line after the match, you can edit your code to the following:

my $nextline;
while ( my $line = <FILE2> ) {
    if ($line =~ /^>(\S+)\s+\S+\s+(\d+)\s+(\d+)\s+(\S+)/) {
        my $cds1 = $1;
        my $cds2 = $2;
        my $cds3 = $3;
        my $cds4 = $4;

        # pull in the next line
        $nextline = <FILE2>;

        for my $cc22 (@file1list) {
            if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
                # print out the first part of the line without a line break
                # and the next line, which already has the line break on it.
                print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4 $nextline";
            }
        }
    }
}

using the following input for file 1:

7065_8#10   992   -   t
7065_8#10   2264  -   c
7065_8#10   413711  +   T

Output:

992 517 1878 forward ATGTCGGAAAAAGAAATTTGGGA
2264 2156 3289 forward ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCACCAAGAACAACATTACCT

Upvotes: 2

choroba
choroba

Reputation: 241848

You start with $nextline = 0. Then, you only change $nextline in the block that starts with if ($nextline). The block is never entered, as $nextline is 0 (false). So, $nextline remains 0 till the end of the script.

Upvotes: 0

Related Questions