Reputation: 5
I would like to merge two tab-delimited text files that share one common column. I have an 'identifier file' that looks like this (2 columns by 1050 rows):
module 1 gene 1
module 1 gene 2
..
module x gene y
I also have a tab-delimited 'target' text file that looks like this (36 columns by 12000 rows):
gene 1 sample 1 sample 2 etc
gene 2 sample 1 sample 2 etc
..
gene z sample 1 sample 2 etc
I would like to merge the two files based on the gene identifier and have both the matching expression values and module affiliations from the identifier and target files. Essentially to take the genes from the identifier file, find them in the target file and create a new file with module #, gene # and expression values all in one file. Any suggestions would be welcome.
My desired output is gene ID tab module affiliation tab sample values separated by tabs.
Here is the script I came up with. The script written does not produce any error messages but it gives me an empty file.
expression_values = {}
matches = []
with open("identifiers.txt") as ids, open("target.txt") as target:
for line in target:
expression_values = {line.split()[0]:line.split()}
for line in ids:
block_idents=line.split()
for gene in expression_values.iterkeys():
if gene==block_idents[1]:
matches.append(block_idents[0]+block_idents[1]+expression_values)
csvfile = "modules.csv"
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in matches:
writer.writerow([val])
Thanks!
Upvotes: 0
Views: 4206
Reputation: 12092
These lines of code are not doing what you are expecting them to do:
for line in target:
expression_values = {line.split()[0]:line.split()}
for line in ids:
block_idents=line.split()
for gene in expression_values.iterkeys():
if gene==block_idents[1]:
matches.append(block_idents[0]+block_idents[1]+expression_values)
The expression values and block_idents will have the values only according to the current line of the files you are updating them with. In other words, the dictionary and the list are not "growing" as more lines are being read. Also TSV files, can be parsed with less effort using csv module.
There are a few assumptions I am making with this rough solution I am suggesting:
First construct a map of the data in the first file as:
import csv
from collections import defaultdict
gene_map = defaultdict(list)
with open(first_file, 'rb') as file_one:
csv_reader = csv.reader(file_one, delimiter='\t')
for row in csv_reader:
gene_map[row[1]].append(row[0])
Read the second file and write to the output file simultaneously.
with open(sec_file, 'rb') as file_two, open(op_file, 'w') as out_file:
csv_reader = csv.reader(file_two, delimiter='\t')
csv_writer = csv.writer(out_file, delimiter='\t')
for row in csv_reader:
values = gene_map.get(row[0], [])
op_list = []
op_list.append(row[0])
op_list.extend(values)
values.extend(row[1:])
csv_writer.writerow(op_list)
Upvotes: 1
Reputation: 865
There are a number of problems with the existing approach, not least of which is that you are throwing away all data from the files except for the last line in each. The assignment under each "for line in
" will replace the contents of the variable, so only the last assignment, for the last line, will have effect.
Assuming each gene appears in only one module, I suggest instead you read the "ids" into a dictionary, saving the module for each geneid:
geneMod = {}
for line in ids:
id = line.split()
geneMod[ id[0] ] = id[1]
Then you can just go though target lines, and for each line, split it, get the gene id gene= targetsplit[0]
and save (or output) the same split fields but inserting the module value, e.g.: print gene, geneMod[gene], targetsplit[1:]
Upvotes: 1