Reputation: 11
This seemed like such an easy task, yet I am boggled.
I have text files, each named after a type of tissue (e.g. cortex.txt
, heart.txt
)
Each file contains two columns, and the column headers are gene_name
and expression_value
Each file contains around 30K to 40K rows
I need to merge the files into one file with 29 columns, with headers
genename, tissue1, tissue2, tissue3, etc. to tissue28
So that each row contains one gene and its expression value in the 28 tissues
The following code creates an array containing a list of every gene name in every file:
my @list_of_genes;
foreach my $input_file ( @input_files ) {
print $input_file, "\n";
open ( IN, "outfiles/$input_file");
while ( <IN> ) {
if ( $_ =~ m/^(\w+\|ENSMUSG\w+)\t/) {
# check if the gene is already in the gene list
my $count = grep { $_ eq $1 } @list_of_genes;
# if not in list, add to the list
if ( $count == 0 ) {
push (@list_of_genes, $1);
}
}
}
close IN;
}
The next bit of code I was hoping would work, but the regex only recognises the first gene name.
Note: I am only testing it on one test file called "tissue1.txt".
The idea is to create an array of all the file names, and then take each gene name in turn and search through each file to extract each value and write it to the outfile in order along the row.
foreach my $gene (@list_of_genes) {
# print the gene name in the first column
print OUT $gene, "\t";
# use the gene name to search the first element of the @input_file array and dprint to the second column
open (IN, "outfiles/tissue1.txt");
while ( <IN> ) {
if ($_ =~ m/^$gene\t(.+)\n/i ) {
print OUT $1;
}
}
print OUT "\n";
}
EDIT 1: Thank you Borodin. The output of your code is indeed a list of every gene name with a all expression values in each tissue.
e.g. Bcl20|ENSMUSG00000000317,0.815796340254127,0.815796340245643
This is great much better than I managed thank you. Two additional things are needed.
1) If a gene name is not found in the a .txt file then a value of 0 should be recorded
e.g. Ht4|ENSMUSG00000000031,4.75878049632381, 0
2) I need a comma separated header row so that the tissue from which each value comes remains associated with the value (basically a table) - the tissue is the name of the text file
e.g. From 2 files heart.txt and liver.txt the first row should be:
genename|id,heart,liver
where genename|id is always the first header
Upvotes: 1
Views: 53
Reputation: 126722
That's a lot of code to implement the simple idiom of using a hash to enforce uniqueness!
It's looking like you want an array of expression values for each different ENSMUSG
string in all *.txt
files in your outfiles
directory.
If the files you need are the only ones in the outfles
directory, then the solution looks like this. I've used autodie
to check the return status of all Perl IO operations (chdir
, open
, print
etc.) and checked only that the $gene
value contains |ENSMUSG
. You may not need even this check if your input data is well-behaved.
Please forgive me if this is bugged, as I have no access to a Perl compiler at present. I have checked it by sight and it looks fine.
use strict;
use warnings 'all';
use autodie;
chdir '/path/to/outfiles';
my %data;
while ( my $file = glob '*.txt' ) {
open my $fh, '<', $file;
while ( <$fh> ) {
my ($gene, $value) = split;
next unless $gene =~ /\|ENSMUSG/;
push @{ $data{$gene} }, $value;
}
}
print join(',', $_, @{ $data{$_} }), "\n" for keys %data;
Upvotes: 1