Create a table by merging many files

Question

This seemed like such an easy task, yet I am boggled.

I have text files, each named after a type of tissue (e.g. cortex.txt, heart.txt)

Each file contains two columns, and the column headers are gene_name and expression_value

Each file contains around 30K to 40K rows

I need to merge the files into one file with 29 columns, with headers

genename, tissue1, tissue2, tissue3, etc. to tissue28

So that each row contains one gene and its expression value in the 28 tissues

The following code creates an array containing a list of every gene name in every file:

my @list_of_genes;

foreach my $input_file ( @input_files ) {

    print $input_file, "
";

    open ( IN, "outfiles/$input_file");

    while (  ) {

        if ( $_ =~ m/^(\w+\|ENSMUSG\w+)	/) {

            # check if the gene is already in the gene list
            my $count = grep { $_ eq $1 } @list_of_genes;

            # if not in list, add to the list
            if ( $count == 0 ) {
                push (@list_of_genes, $1);
            }
        }
    }

    close IN;
}

The next bit of code I was hoping would work, but the regex only recognises the first gene name.

Note: I am only testing it on one test file called "tissue1.txt".

The idea is to create an array of all the file names, and then take each gene name in turn and search through each file to extract each value and write it to the outfile in order along the row.

foreach my $gene (@list_of_genes) {

    # print the gene name in the first column
    print OUT $gene, "	";

    # use the gene name to search the first element of the @input_file array and dprint to the second column
    open (IN, "outfiles/tissue1.txt");

       while (  ) { 

        if ($_ =~ m/^$gene	(.+)
/i ) {
            print OUT $1;
        }

    }

    print OUT "
";
}

EDIT 1: Thank you Borodin. The output of your code is indeed a list of every gene name with a all expression values in each tissue.

e.g. Bcl20|ENSMUSG00000000317,0.815796340254127,0.815796340245643

This is great much better than I managed thank you. Two additional things are needed.

1) If a gene name is not found in the a .txt file then a value of 0 should be recorded

e.g. Ht4|ENSMUSG00000000031,4.75878049632381, 0

2) I need a comma separated header row so that the tissue from which each value comes remains associated with the value (basically a table) - the tissue is the name of the text file

e.g. From 2 files heart.txt and liver.txt the first row should be:

genename|id,heart,liver

where genename|id is always the first header

Create a table by merging many files

Answers (1)

Related Questions