Reputation: 33
I am pretty new to Perl codes, and I am merging some datasets with the following code. The data is set up as such: first row specifying the sample names, followed by the counts on the second, third columns.... The first column specifies the gene names. I've got 2 big datasets that I'm merging together, and I have been using the following Perl script, by specifying the path to the perl script, and running the following code in Terminal:
$ cd /path/to/file
$ perl /path/to/file dataset1.txt dataset2.txt merged.txt
The Perl script is as follows:
use strict;
my $file1=$ARGV[0];
my $file2=$ARGV[1];
my $out=$ARGV[2];
my %hash=();
open(RF,"$file1") or die $!;
while(my $line=<RF>){
chomp($line);
my @arr=split(/\t/,$line);
my $gene=shift(@arr);
$hash{$gene}=join("\t",@arr);
}
close(RF);
open(RF,"$file2") or die $!;
open(WF,">$out") or die $!;
while(my $line=<RF>){
chomp($line);
my @arr=split(/\t/,$line);
my $gene=shift(@arr);
if(exists $hash{$gene}){
print WF $gene . "\t" . $hash{$gene} . "\t" . join("\t",@arr) . "\n";
}
}
close(WF);
close(RF);
With the above code is I am supposed to get a merged table, with the duplicate rows deleted, and the second text file's (Sample A to Sample Z) columns merged to the first text file's columns (Sample 1 to Sample 100), so it should look like this, separated by tabs.
Gene Name Sample 1 Sample 2 ..... Sample A Sample B...
TP53 2.345 2.234 4.32 4.53
The problem arises when my merged files come back with the two datasets merged, however the second dataset in the next row instead of the same row. It will recognise, sort, and merge the counts, but onto the next row. Is there something wrong with my codes or my input?
Thank you for all of your help!!
Upvotes: 2
Views: 208
Reputation: 67930
The double line issue might be because of foreign line endings in your input file. You can check this with a command such as:
$ perl -MData::Dumper -ne'$Data::Dumper::Useqq=1; print Dumper $_' file1.txt
There are more issues with your code, as follows.
What you seem to be doing is joining lines based on the name in column 1. You should be aware that this match is done case-sensitively, so it will differentiate between for example tp53
and TP53
, or Gene name
and Gene Name
, or something as subtle as TP53
and TP53
(an extra space). That can be both good and bad, but be prepared for edge cases.
You are expecting 3 arguments to your program, input files and output, but this is a quite un-Perlish way to go about it. I would use the diamond operator for input files, and then redirect output with shell commands, such as:
$ perl foo.pl file1 file2 > merged.txt
This will give you the flexibility of adding more files to merge, for example, and gives you the option to test the merge without committing to a file.
You are using 2 argument open
command, without specifying an open mode (e.g. "<"
). That is very dangerous and leaves you open to code injection. For example, someone could enter "| rm -rf /"
as the first argument to your program and delete your whole hard drive (or as much as their permissions allowed). To prevent this, you use 3-argument open and specify a hard coded open mode.
Open commands in Perl should also use the lexical file handle, e.g. my $fh
, and not global. It should look like this:
open my $fh, "<", $input1 or die $!;
open my $fh_out, ">", $output or die $!;
But since we are using the diamond operator, Perl handles that for us automagically.
You also do not need to separate the reading of the files into two loops, since you are basically doing the same thing. There is also no need to first split the lines, and then join them back together.
I wrote this as a sample of how it can be done:
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my ($name, $line) = /^([^\t]+)(.+)/; # using a regex match avoiding split
$data{$name} .= $line; # merge lines using concatenation
}
for my $name (sort keys %data) {
print $name . $data{$name} . "\n";
}
__DATA__
Gene Name Sample 1 Sample 2 Sample 3 Sample 4
TP53 2.345 2.234 4.32 4.53
TP54 2.345 2.234 4.32 4.53
TP55 2.345 2.234 4.32 4.53
Gene Name Sample A Sample B Sample C Sample D
TP53 2.345 2.234 4.32 2.53
TP54 2.212 1.234 3.32 6.53
TP55 1.345 2.114 7.32 5.53
On my system it gives the output:
Gene Name Sample 1 Sample 2 Sample 3 Sample 4 Sample A Sample B Sample C Sample D
TP53 2.345 2.234 4.32 4.53 2.345 2.234 4.32 2.53
TP54 2.345 2.234 4.32 4.53 2.212 1.234 3.32 6.53
TP55 2.345 2.234 4.32 4.53 1.345 2.114 7.32 5.53
This will output the lines in alphabetical order. If you want to preserve the order of the files, you can collect the names in an array while reading the file, and use that when printing. Arrays preserve the order, hash keys do not.
Upvotes: 3