Kenneth Condon
Kenneth Condon

Reputation: 11

Create a table by merging many files

This seemed like such an easy task, yet I am boggled.

I have text files, each named after a type of tissue (e.g. cortex.txt, heart.txt)

Each file contains two columns, and the column headers are gene_name and expression_value

Each file contains around 30K to 40K rows

I need to merge the files into one file with 29 columns, with headers

genename, tissue1, tissue2, tissue3, etc. to tissue28

So that each row contains one gene and its expression value in the 28 tissues

The following code creates an array containing a list of every gene name in every file:

my @list_of_genes;

foreach my $input_file ( @input_files ) {

    print $input_file, "\n";

    open ( IN, "outfiles/$input_file");

    while ( <IN> ) {

        if ( $_ =~ m/^(\w+\|ENSMUSG\w+)\t/) {

            # check if the gene is already in the gene list
            my $count = grep { $_ eq $1 } @list_of_genes;

            # if not in list, add to the list
            if ( $count == 0 ) {
                push (@list_of_genes, $1);
            }
        }
    }

    close IN;
}

The next bit of code I was hoping would work, but the regex only recognises the first gene name.

Note: I am only testing it on one test file called "tissue1.txt".

The idea is to create an array of all the file names, and then take each gene name in turn and search through each file to extract each value and write it to the outfile in order along the row.

foreach my $gene (@list_of_genes) {

    # print the gene name in the first column
    print OUT $gene, "\t";

    # use the gene name to search the first element of the @input_file array and dprint to the second column
    open (IN, "outfiles/tissue1.txt");

       while ( <IN> ) { 

        if ($_ =~ m/^$gene\t(.+)\n/i ) {
            print OUT $1;
        }

    }

    print OUT "\n";
}

EDIT 1: Thank you Borodin. The output of your code is indeed a list of every gene name with a all expression values in each tissue.

e.g. Bcl20|ENSMUSG00000000317,0.815796340254127,0.815796340245643

This is great much better than I managed thank you. Two additional things are needed.

1) If a gene name is not found in the a .txt file then a value of 0 should be recorded

e.g. Ht4|ENSMUSG00000000031,4.75878049632381, 0

2) I need a comma separated header row so that the tissue from which each value comes remains associated with the value (basically a table) - the tissue is the name of the text file

e.g. From 2 files heart.txt and liver.txt the first row should be:

genename|id,heart,liver

where genename|id is always the first header

Upvotes: 1

Views: 53

Answers (1)

Borodin
Borodin

Reputation: 126722

That's a lot of code to implement the simple idiom of using a hash to enforce uniqueness!

It's looking like you want an array of expression values for each different ENSMUSG string in all *.txt files in your outfiles directory.

If the files you need are the only ones in the outfles directory, then the solution looks like this. I've used autodie to check the return status of all Perl IO operations (chdir, open, print etc.) and checked only that the $gene value contains |ENSMUSG. You may not need even this check if your input data is well-behaved.

Please forgive me if this is bugged, as I have no access to a Perl compiler at present. I have checked it by sight and it looks fine.

use strict;
use warnings 'all';
use autodie;

chdir '/path/to/outfiles';

my %data;

while ( my $file = glob '*.txt' ) {

    open my $fh, '<', $file;

    while ( <$fh> ) {
        my ($gene, $value) = split;
        next unless $gene =~ /\|ENSMUSG/;
        push @{ $data{$gene} }, $value;
    }
}

print join(',', $_, @{ $data{$_} }), "\n" for keys %data;

Upvotes: 1

Related Questions