Print FASTA sequence after successful header match onto the same line in output file

Question

After a previous question i have some code which nearly does what i intend to do, but not quite.

I am trying to compare every position from FILE1 (3750/126477 etc) with every range in FILE2 (517-1878,2156-3289 etc). If it falls in a range, i want to print the position, range, direction and then the FASTA sequence from the next line onto the same line in the output file. Currently if multiple positions lie in the same range then it will group all the positions into a block before including the sequence only on the last line, when i want every match to include the relevant gene sequence.

My FILE1 example data:

7065_8#10   3750    -   t
7065_8#10   126477  -   c
7065_8#10   1200    +   T
7065_8#10   3800    -   t

My FILE2 example data:

>SAEMRSA15_00010 dnaA_chromosomal_replication_initiator_protein_DnaA 517  1878 forward
ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGAAATTGCTCAAGAAAAATTATCAGCTGTAAGTTACTCAACTTTCCTAAA
>SAEMRSA15_00020 dnaN_DNA_polymerase_III,_beta_chain 2156  3289 forward
ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCACCAAGAACAACA
>SAEMRSA15_00030 conserved_hypothetical_protein 3670  3915 forward
GTGATTATTTTGGTTCAAGAAGTTGTAGTAGAAGGAGACATTAATTTAGGTCAATTTCTAAAAACAGAAGGGATTATTGAATCTGGTGGTCAAG

My code:

#!/usr/bin/perl 

use strict;
use warnings;
use autodie;

my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt"; 

open FILE1, "/Users/edwardtickle/Documents/CC22indels.tab";

open FILE2, "/Users/edwardtickle/Documents/CC22_CDS_rmmge.aln";

open (OUTPUTFILE, ">$outputfile");
my @file1list=();

while () {
if (/^\S+\s+(\d+)/) {
push @file1list, $1;
}
}

my $nextline = 0;
close FILE1;

while ( my $line =  ) {
if ($nextline) {
    print OUTPUTFILE "$line
";
    $nextline = '';
}
elsif ($line =~ /^>(\S+)\s+\S+\s+(\d+)\s+(\d+)\s+(\S+)/) {
    my $cds1 = $1;
    my $cds2 = $2;
    my $cds3 = $3;
    my $cds4 = $4;

    for my $cc22 (@file1list) {
        if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
            $nextline++;
            print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4
";
        }
    }
    }
  }

close FILE2;

My results:

1200 517 1878 forward
ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGAAATTGCTCAAGAAAAATTATCAGCTGTAAGTTACTCAACTTTCCTAAA

3750 3670 3915 forward
3800 3670 3915 forward
 GTGATTATTTTGGTTCAAGAAGTTGTAGTAGAAGGAGACATTAATTTAGGTCAATTTCTAAAAACAGAAGGGATTATTGAATCTGGTGGTCAAG

My desired results:

1200 517 1878 forward
ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGAAATTGCTCAAGAAAAATTATCAGCTGTAAGTTACT

3750 3670 3915 forward    GTGATTATTTTGGTTCAAGAAGTTGTAGTAGAAGGAGACATTAATTTAGGTCAATTTCTAAAAACAGAAGGGATTATTGA
3800 3670 3915 forward
GTGATTATTTTGGTTCAAGAAGTTGTAGTAGAAGGAGACATTAATTTAGGTCAATTTCTAAAAACAGAAGGGATTATTGAATCTGGTGGTCAAG

I believe it is due to the fact that the first part of the code is never matched until the second if rule, but i don't know how to change the order while keeping it functional.

Alternatively, is there a way of printing the next line after a header match if it includes the letters ATCG (which it obviously always will). This would strike me as a more efficient way but yet again i don't know where to start.

Thanks for your help!

i alarmed alien · Accepted Answer

To get the results you want without changing your existing code too much, you can get the sequence line when you're processing the header line:

while ( my $line =  ) {
    if ($line =~ /^>(\S+)\s+\S+\s+(\d+)\s+(\d+)\s+(\S+)/) {
        my $cds1 = $1;
        my $cds2 = $2;
        my $cds3 = $3;
        my $cds4 = $4;
        # fetch the next line from the file -- i.e. the sequence
        $nextline = ;

        for my $cc22 (@file1list) {
            if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
                print "$cc22 $cds2 $cds3 $cds4 $nextline";
            }
        }
    }
}

Print FASTA sequence after successful header match onto the same line in output file

Answers (2)

Related Questions