jacquelynesun
jacquelynesun

Reputation: 33

merging columns but it goes into a new line in perl code

I am pretty new to Perl codes, and I am merging some datasets with the following code. The data is set up as such: first row specifying the sample names, followed by the counts on the second, third columns.... The first column specifies the gene names. I've got 2 big datasets that I'm merging together, and I have been using the following Perl script, by specifying the path to the perl script, and running the following code in Terminal:

$ cd /path/to/file
$ perl /path/to/file dataset1.txt dataset2.txt merged.txt

The Perl script is as follows:

use strict;
my $file1=$ARGV[0];              
my $file2=$ARGV[1];             
my $out=$ARGV[2];               
my %hash=();                     

open(RF,"$file1") or die $!;    
while(my $line=<RF>){
  chomp($line);
  my @arr=split(/\t/,$line);
  my $gene=shift(@arr);
  $hash{$gene}=join("\t",@arr);
}
close(RF);

open(RF,"$file2") or die $!;     
open(WF,">$out") or die $!;    
while(my $line=<RF>){
  chomp($line);
  my @arr=split(/\t/,$line);
  my $gene=shift(@arr);
  if(exists $hash{$gene}){
    print WF $gene . "\t" . $hash{$gene} . "\t" . join("\t",@arr) . "\n";
  }
}
close(WF);
close(RF);

With the above code is I am supposed to get a merged table, with the duplicate rows deleted, and the second text file's (Sample A to Sample Z) columns merged to the first text file's columns (Sample 1 to Sample 100), so it should look like this, separated by tabs.

Gene Name Sample 1 Sample 2 ..... Sample A Sample B...
TP53      2.345    2.234          4.32     4.53

The problem arises when my merged files come back with the two datasets merged, however the second dataset in the next row instead of the same row. It will recognise, sort, and merge the counts, but onto the next row. Is there something wrong with my codes or my input?

Thank you for all of your help!!

Upvotes: 2

Views: 208

Answers (1)

TLP
TLP

Reputation: 67930

The double line issue might be because of foreign line endings in your input file. You can check this with a command such as:

$ perl -MData::Dumper -ne'$Data::Dumper::Useqq=1; print Dumper $_' file1.txt

There are more issues with your code, as follows.

What you seem to be doing is joining lines based on the name in column 1. You should be aware that this match is done case-sensitively, so it will differentiate between for example tp53 and TP53, or Gene name and Gene Name, or something as subtle as TP53 and TP53 (an extra space). That can be both good and bad, but be prepared for edge cases.

You are expecting 3 arguments to your program, input files and output, but this is a quite un-Perlish way to go about it. I would use the diamond operator for input files, and then redirect output with shell commands, such as:

$ perl foo.pl file1 file2 > merged.txt

This will give you the flexibility of adding more files to merge, for example, and gives you the option to test the merge without committing to a file.

You are using 2 argument open command, without specifying an open mode (e.g. "<"). That is very dangerous and leaves you open to code injection. For example, someone could enter "| rm -rf /" as the first argument to your program and delete your whole hard drive (or as much as their permissions allowed). To prevent this, you use 3-argument open and specify a hard coded open mode.

Open commands in Perl should also use the lexical file handle, e.g. my $fh, and not global. It should look like this:

open my $fh,     "<", $input1 or die $!;
open my $fh_out, ">", $output or die $!;

But since we are using the diamond operator, Perl handles that for us automagically.

You also do not need to separate the reading of the files into two loops, since you are basically doing the same thing. There is also no need to first split the lines, and then join them back together.

I wrote this as a sample of how it can be done:

use strict;
use warnings;

my %data;
while (<DATA>) {
    chomp;
    my ($name, $line) = /^([^\t]+)(.+)/;    # using a regex match avoiding split
    $data{$name} .= $line;                  # merge lines using concatenation
}

for my $name (sort keys %data) {
    print $name . $data{$name} . "\n";
}

__DATA__
Gene Name   Sample 1    Sample 2    Sample 3    Sample 4
TP53    2.345   2.234   4.32    4.53
TP54    2.345   2.234   4.32    4.53
TP55    2.345   2.234   4.32    4.53
Gene Name   Sample A    Sample B    Sample C    Sample D
TP53    2.345   2.234   4.32    2.53
TP54    2.212   1.234   3.32    6.53
TP55    1.345   2.114   7.32    5.53

On my system it gives the output:

Gene Name      Sample 1        Sample 2        Sample 3        Sample 4        Sample A        Sample B        Sample C        Sample D
TP53    2.345   2.234   4.32    4.53    2.345   2.234   4.32    2.53
TP54    2.345   2.234   4.32    4.53    2.212   1.234   3.32    6.53
TP55    2.345   2.234   4.32    4.53    1.345   2.114   7.32    5.53

This will output the lines in alphabetical order. If you want to preserve the order of the files, you can collect the names in an array while reading the file, and use that when printing. Arrays preserve the order, hash keys do not.

Upvotes: 3

Related Questions