Alex
Alex

Reputation: 7045

Merge files based on a mapping in another file

I have written a script in Perl that merges files based on a mapping in a third file; the reason I am not using join is because lines won't always match. The code works, but gives an error that doesn't appear to affect output: Use of uninitialized value in join or string at join.pl line 43, <$fh> line 21. As I am relatively new to Perl I have been unable to understand what is causing this error. Any help resolving this error or advice about my code would be greatly appreciated. I have provided example input and output below.

join.pl

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Tie::File;
use Scalar::Util qw(looks_like_number);

chomp( my $infile  = $ARGV[0] );
chomp( my $infile1 = $ARGV[1] );
chomp( my $infile2 = $ARGV[2] );
chomp( my $outfile = $ARGV[3] );

open my $mapfile,   '<', $infile  or die "Could not open $infile: $!";
open my $file1,   '<', $infile1  or die "Could not open $infile1: $!";
open my $file2,   '<', $infile2  or die "Could not open $infile2: $!";
tie my @tieFile1, 'Tie::File', $infile1 or die "Could not open $infile1: $!";
tie my @tieFile2, 'Tie::File', $infile2 or die "Could not open $infile2: $!";
open my $output, '>', $outfile or die "Could not open $outfile: $!";

my %map1;
my %map2;
# This loop will read two input files and populate two hashes
# using the coordinates (field 2) and the current line number
while ( my $line1 = <$file1>, my $line2 = <$file2> ) {
    my @row1 = split( "\t", $line1 );
    my @row2 = split( "\t", $line2 );
    # $. holds the line number
    $map1{$row1[1]} = $.;
    $map2{$row2[1]} = $.;
}
close($file1);
close($file2);

while ( my $line = <$mapfile> ) {
    chomp $line;
    my @row = split( "\t", $line );
    my $species1 = $row[1];
    my $reference1 = $map1{$species1};
    my $species2 = $row[3];
    my $reference2 = $map2{$species2};
    my @nomatch  = ("NA", "", "NA", "", "", "", "", "NA", "NA");
    # test numeric
    if ( looks_like_number($reference1) && looks_like_number($reference2) ) {
        # do the do using the maps
        print $output join("\t", $tieFile1[$reference1], $tieFile2[$reference2]), "\n";
    }
    elsif ( looks_like_number($reference1) )
    {
        print $output join("\t", $tieFile1[$reference1], @nomatch), "\n";
    }
    elsif ( looks_like_number($reference2) )
    {
        print $output join("\t", @nomatch, $tieFile2[$reference2]), "\n";
    }
}
close($output);
untie @tieFile1;
untie @tieFile2;

input_1:

Scf_3L  12798910    T   0   41  0   0   NA  NA
Scf_3L  12798911    C   0   0   43  0   NA  NA
Scf_3L  12798912    A   42  0   0   0   NA  NA
Scf_3L  12798913    G   0   0   0   44  NA  NA
Scf_3L  12798914    T   0   42  0   0   NA  NA
Scf_3L  12798915    G   0   0   0   44  NA  NA
Scf_3L  12798916    T   0   42  0   0   NA  NA
Scf_3L  12798917    A   41  0   0   0   NA  NA
Scf_3L  12798918    G   0   0   0   43  NA  NA
Scf_3L  12798919    T   0   43  0   0   NA  NA
Scf_3L  12798920    T   0   41  0   0   NA  NA

input_2:

3L  12559896    T   0   31  0   0   NA  NA
3L  12559897    C   0   0   33  0   NA  NA
3L  12559898    A   34  0   0   0   NA  NA
3L  12559899    G   0   0   0   33  NA  NA
3L  12559900    T   0   34  0   0   NA  NA
3L  12559901    G   0   0   0   33  NA  NA
3L  12559902    T   0   33  0   0   NA  NA
3L  12559903    A   33  0   0   0   NA  NA
3L  12559904    G   0   0   0   33  NA  NA
3L  12559905    T   0   34  0   0   NA  NA
3L  12559906    T   0   33  0   0   NA  NA

map:

3L  12798910    T   12559896    T
3L  12798911    C   12559897    C
3L  12798912    A   12559898    A
3L  12798913    G   12559899    G
3L  12798914    T   12559900    T
3L  12798915    G   12559901    G
3L  12798916    T   12559902    T
3L  12798917    A   12559903    A
3L  12798918    G   12559904    G
3L  12798919    T   12559905    T
3L  12798920    T   12559906    T

output:

Scf_3L  12798910    T   0   41  0   0   NA  NA    3L    12559896    T   0   31  0   0   NA  NA
Scf_3L  12798911    C   0   0   43  0   NA  NA    3L    12559897    C   0   0   33  0   NA  NA
Scf_3L  12798912    A   42  0   0   0   NA  NA    3L    12559898    A   34  0   0   0   NA  NA
Scf_3L  12798913    G   0   0   0   44  NA  NA    3L    12559899    G   0   0   0   33  NA  NA
Scf_3L  12798914    T   0   42  0   0   NA  NA    3L    12559900    T   0   34  0   0   NA  NA
Scf_3L  12798915    G   0   0   0   44  NA  NA    3L    12559901    G   0   0   0   33  NA  NA
Scf_3L  12798916    T   0   42  0   0   NA  NA    3L    12559902    T   0   33  0   0   NA  NA
Scf_3L  12798917    A   41  0   0   0   NA  NA    3L    12559903    A   33  0   0   0   NA  NA
Scf_3L  12798918    G   0   0   0   43  NA  NA    3L    12559904    G   0   0   0   33  NA  NA
Scf_3L  12798919    T   0   43  0   0   NA  NA    3L    12559905    T   0   34  0   0   NA  NA
Scf_3L  12798920    T   0   41  0   0   NA  NA    3L    12559906    T   0   33  0   0   NA  NA

Upvotes: 0

Views: 67

Answers (1)

Borodin
Borodin

Reputation: 126722

The immediate problem is that the indices of the tied arrays start at zero, while the line numbers in $. start at 1. That means you need to subtract one from $. or from the $reference variables before using them as indices. So your resulting data was never correct in the first place, and you may have overlooked that if it weren't for the warning!

I fixed that and also tidied up your code a little. I mostly added use autodie so that there's no need to check the status of IO operations (except for Tie::File), changed to list assignments, moved the code to read the files into a subroutine, and added code blocks so that the lexical file handles would be closed automatically

I also used the tied arrays to build the %map hashes instead of opening the files separately, which means their values are already zero-based as they need to be

Oh, and I removed looks_like_number, because the $reference variables must be either numeric or undef because that's all we put into the hash. The correct way to check that a value isn't undef is with the defined operator

#!/usr/bin/perl

use strict;
use warnings 'all';
use autodie;

use Fcntl 'O_RDONLY';
use Tie::File;

my ( $mapfile, $infile1, $infile2, $outfile ) = @ARGV;

{
    tie my @file1, 'Tie::File' => $infile1, mode => O_RDONLY
        or die "Could not open $infile1: $!";

    tie my @file2, 'Tie::File' =>$infile2, mode => O_RDONLY
            or die "Could not open $infile2: $!";

    my %map1 = map { (split /\t/, $file1[$_], 3)[1] => $_ } 0 .. $#file1;
    my %map2 = map { (split /\t/, $file2[$_], 3)[1] => $_ } 0 .. $#file2;

    open my $map_fh, '<', $mapfile;

    open my $out_fh, '>', $outfile;

    while ( <$map_fh> ) {
        chomp;
        my @row = split /\t/;

        my ( $species1, $species2 ) = @row[1,3];
        my $reference1 = $map1{$species1};
        my $reference2 = $map2{$species2};

        my @nomatch    = ( "NA", "", "NA", "", "", "", "", "NA", "NA" );

        my @fields = (
            ( defined $reference1 ? $file1[$reference1] : @nomatch),
            ( defined $reference2 ? $file2[$reference2] : @nomatch),
        );

        print $out_fh join( "\t", @fields ), "\n";
    }
}

output

Scf_3L  12798910    T   0   41  0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798911    C   0   0   43  0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798912    A   42  0   0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798913    G   0   0   0   44  NA  NA  NA      NA                  NA  NA
Scf_3L  12798914    T   0   42  0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798915    G   0   0   0   44  NA  NA  NA      NA                  NA  NA
Scf_3L  12798916    T   0   42  0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798917    A   41  0   0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798918    G   0   0   0   43  NA  NA  NA      NA                  NA  NA
Scf_3L  12798919    T   0   43  0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798920    T   0   41  0   0   NA  NA  NA      NA                  NA  NA

Upvotes: 2

Related Questions