Vikas
Vikas

Reputation: 327

Manipulating files according to indexes by perl

I am working on some genome data and I have 2 files ->

File1

A1 1   10
A1  15  20
A2  2   11
A2  13  16

File2

>A1
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT
>A2
GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA
AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC
CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT

In file 1, 2nd and 3rd column represents the indexes in File2. So I want that, if character in column1 of file1 matches with character followed by symbol (>) in file2 , then from next line of that file2 give back the substring according to indexes in col2 and col3 of file1. (sorry, I know its complicated) Here is the desire output ->

Output

>A1#1:10
CTATTATTTA
>A1#15:20
ACCTA
>A2#2:11
TCTGCACAGC
>A2#13:16
GCTT

I know if I have only 1 string I can take out sub-string very easily ->

@ARGV or die "No input file specified";
open $first, '<',$ARGV[0] or die "Unable to open input file: $!";
$string="GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT";
while (<$first>) 
{
@cols = split /\s+/;
$co=$cols[1]-1;
$length=$cols[2]-$co;
$fragment =  substr $string, $co, $length;
print ">",$cols[0],"#",$cols[1],":",$cols[2],"\n",$fragment,"\n";
}

but here my problem is when should I input my second file and how should I match the character in col1 (of file1) with character in file2 (followed by > symbol) and then how to get substring?

Upvotes: 0

Views: 370

Answers (3)

Rich
Rich

Reputation: 183

This 2nd update turns the master file into a hash of arrays as well.

This treats each row in the 2nd file as individual sequences.

use strict;
my $master_file = "dna_master.txt";
if ($#ARGV) {
    print "Usage: $0 [filename(s)]\n";
    exit 1;
}

my %Data = read_master($master_file);

foreach my $index_file (@ARGV) {
    my %Index = read_index($index_file);
    foreach my $key (sort keys %Index) {
        foreach my $i (@{$Index{$key}}) {
            my ($start,$stop) = @$i;
            print ">$key#$start:$stop\n";
            my $pos = $start - 1;
            my $count = $stop - $start + 1;
            foreach my $seq (@{$Data{$key}}) {
                print substr($seq,$pos,$count)."\n";
            }
        }
    }
}

sub read_file {
    my $file = shift;
    my @lines;
    open(FILE, $file) or die "Error: cannot open $file\n$!";
    while(<FILE>){
        chomp; #remove newline
        s/(^\s+|\s+$)//g; # strip lead/trail whitespace
        next if /^$/;  # skip blanks
        push @lines, $_;
    }
    close FILE;
    return @lines;
}

sub read_index {
    my $file = shift;
    my @lines = read_file($file);
    my %index;
    foreach (@lines) {
        my ($key,$start,$stop) = split /\s+/;
        push @{$index{$key}}, [$start,$stop]; 
    }
    return %index;
}

sub read_master {
    my $file = shift;
    my %master;
    my $key;
    my @lines = read_file($file);
    foreach (@lines) {
        if ( m{^>(\w+)} ) { $key = $1 }
        else { push @{ $master{$key} }, $_ }
    }
    return %master;
}

Output:

>A1#1:10
CTATTATTTA
AAGTGTGTTA
>A1#15:20
ACCTAC
ATTAAT
>A2#2:11
TCTGCACAGC
ACCCCCCCCT
AAACCCCAAA
>A2#13:16
GCTT
CCCC
ACAA

Upvotes: 1

Rich
Rich

Reputation: 183

I wasnt sure if they were all one continuous line or separate lines. I set it up as continuous for now.

Basically, read the 2nd file as master. Then you can process as many index files as you need.

You can use hash of arrays to help with the indexing. push @{$index{$key}}, [$start,$stop];

use strict;
my $master_file = "dna_master.txt";
if ($#ARGV) {
    print "Usage: $0 [filename(s)]\n";
    exit 1;
}

my %Data = read_master($master_file);

foreach my $index_file (@ARGV) {
    my %Index = read_index($index_file);
    foreach my $key (sort keys %Index) {
        foreach my $i (@{$Index{$key}}) {
            my ($start,$stop) = @$i;
            print ">$key#$start:$stop\n";
            my $pos = $start - 1;
            my $count = $stop - $start + 1;
            print substr($Data{$key},$pos,$count)."\n";
        }
    }
}

sub read_file {
    my $file = shift;
    my @lines;
    open(FILE, $file) or die "Error: cannot open $file\n$!";
    while(<FILE>){
        chomp; #remove newline
        s/(^\s+|\s+$)//g; # strip lead/trail whitespace
        next if /^$/;  # skip blanks
        push @lines, $_;
    }
    close FILE;
    return @lines;
}

sub read_index {
    my $file = shift;
    my @lines = read_file($file);
    my %index;
    foreach (@lines) {
        my ($key,$start,$stop) = split /\s+/;
        push @{$index{$key}}, [$start,$stop]; 
    }
    return %index;
}

sub read_master {
    my $file = shift;
    my %master;
    my $key;
    my @lines = read_file($file);
    foreach (@lines) {
        if ( m{^>(\w+)} ) { $key = $1 }
        else { $master{$key} .= $_ }
    }
    return %master;
}

Upvotes: 1

m0skit0
m0skit0

Reputation: 25873

Load File2 in a Hash, with A1, A2... as keys, and the DNA sequence as value. This way you can get the DNA sequence easily.

Upvotes: 1

Related Questions