Reputation: 3
(Note: Column headers are there for readability and are not in the actual files)
COLUMN1 COLUMN2 COLUMN3
AG_446337835.1 example1 grgsdt
AG_448352465.1 example2 190197
AG_449465753.1 example3 h837h8
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4
AG_441227463.1 example6 f3fw4t
AG_449986090.1 example7 gft7r4
AG_445666926.1 example8 4vsr55
AG_441004541.1 example9 fh893b
AG_444837264.1 example0 k3883d
COLUMN1 COLUMN2
grgsdt AAHG
h837h8 JUJN
190197 POKJ
f45ge4 DFRF
gft7r4 NNHN
d34tw4
fh893b YUNIP
k3883d YUNIP
f3fw4t YUNIP
190197 YUNIP
4vsr55 GHGF
COLUMN1 COLUMN2 COLUMN3 COLUMN4 (formerly column2 from file2)
AG_446337835.1 example1 grgsdt AAHG
AG_448352465.1 example2 190197 POKJ YUNIP
AG_449465753.1 example3 h837h8 JUJN
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4 DFRF
AG_441227463.1 example6 f3fw4t YUNIP
AG_449986090.1 example7 gft7r4 NNHN
AG_445666926.1 example8 4vsr55 GHGF
AG_441004541.1 example9 fh893b YUNIP
AG_444837264.1 example0 k3883d YUNIP
I am barely familiar with Perl (or programming general) and I was wondering if you would mind advising me with this problem.
Essentially, Column 3 in file1 corresponds to Column 1 in File2.
I want to take each line in file1, read column 3 of that line, search file2 for a matching entry, if a matching entry exists print the line from file1 with an extra column from file 2 to a new file (as seen in the example output).
The file sizes are
File1: 2GB
File2: 718MB
This script will be run off a machine with 250GB of ram so memory is not an issue.
This is what I have so far
#!/usr/bin/perl ;
#use warnings;
use Getopt::Long qw(GetOptions);
use experimental 'smartmatch';
#Variable to store inputted text file data
my $db ;
my $db2 ;
#Open and read File one into memory
open FPIN, "file1.txt" or die "Could not open";
my @file1 = <FPIN> ;
close FPIN;
#Open and read file two into memory
open FPIN, "file2.tab" or die "Could not open";
my @file2 = <FPIN> ;
close FPIN ;
foreach (@file2)
{
if (/(^\w+)\t(.+)/)
{
split /\t/, $2;
$db2->{$1}->{"geneName"} = $1 ;
$db2->{$1}->{"protein"} = $2 ;
}
}
foreach (@file1)
{
#if line begins with any word character tab and anything
if (/(^\w+.\d+)\t(.+)/)
{
my @fields = split /\t/, $2;
my $refSeqID = $1;
#assign the data in the array to variables
my ($geneSymbol, $geneName) = @fields[0, 1];
#Create database data structure and fill it with the info
$db->{$2}->{"refSeqID"} = $refSeqID ;
$db->{$2}->{"geneSymbol"} = $geneSymbol ;
$db->{$2}->{"geneName"} = $geneName ;
}
}
foreach my $id (sort keys %{$db2})
{
if ( exists $db->{$id} )
{
print $db2->{$id}."\t".$db->{$id}->{$geneSymbol}."\t".$db->{$id}->
{$refSeqID}."\t".$db2->{$id}->{$protein}."\n";
}
}
I seem to be able to read both files into memory correctly. However I have been completely unable to compare the files to each other and I am dumbstruck on how to approach it.
Actually printing it will be another issue I need to tackle.
Upvotes: 0
Views: 480
Reputation: 126722
This will do as you ask
It starts by reading file2.txt
and building a hash %f2
that relates the value of the first column to the value of the second
Thereafter it's just a matter of reading through file1.txt
, splitting it into fields, and adding a further field obtained by accessing the hash using the value of the third field
I've used autodie
to save the trouble of handling errors in the open
calls. Otherwise everything is standard
I've just noticed that a column 1 value may be repeated in file2.txt
, so I've changed the code to make each key of the hash correspond to an array of values. All the values in the array appear, space-separated, in column 4 of the output
use strict;
use warnings 'all';
use autodie;
my %f2;
{
open my $fh, '<', 'file2.txt';
while ( <$fh> ) {
my ($key, $val) = split;
$f2{$key} //= [];
push @{ $f2{$key} }, $val if $val;
}
}
open my $fh, '<', 'file1.txt';
while ( <$fh> ) {
my @line = split;
my $c4 = $f2{$line[2]};
push @line, $c4 ? join(' ', @$c4) : '';
local $" = "\t";
print "@line\n";
}
AG_446337835.1 example1 grgsdt AAHG
AG_448352465.1 example2 190197 POKJ YUNIP
AG_449465753.1 example3 h837h8 JUJN
AG_449366462.1 example4 d34tw4
AG_444725037.1 example5 f45ge4 DFRF
AG_441227463.1 example6 f3fw4t YUNIP
AG_449986090.1 example7 gft7r4 NNHN
AG_445666926.1 example8 4vsr55 GHGF
AG_441004541.1 example9 fh893b YUNIP
AG_444837264.1 example0 k3883d YUNIP
Upvotes: 1
Reputation: 23824
This one makes a left join. The key idea is to use geneName
as a key in a hash.
#! /usr/bin/perl
use strict;
use warnings;
my %data = ();
open $a, "file1";
while (<$a>) {
chomp;
my @c = split;
$data{$c[2]} = [$c[0], $c[1], $c[2]];
}
open $b, "file2";
while (<$b>) {
chomp;
my @c = split;
push @{$data{$c[0]}}, exists $c[1] ? $c[1] : "";
}
print map { "@{$_}\n" } values %data;
Upvotes: 0