Reputation: 349
I have two files and I am trying to extract some values from file 1, like this:
File1:
2 word1
4 word2
4 word2_1
4 word2_2
8 word5
8 word5_3
File 2:
4
8
What I want is to extract every lines starting by 4 and 8 (from file 2) and they are lots. So usually if only one line would match I would use a python dictionary, one key one element easy! But now that I have multiple element matching to the same key, my script would only extract the last one (obviously as it goes along it will erase previous ones!). So I get this is not how it works but I have no idea and I would be very happy if someone can help me start.
Here is my "usual" code:
gene_count = {}
my_file = open('file1.txt')
for line in my_file:
columns = line.strip().split()
gene = columns[0]
count = columns[1:13]
gene_count[gene] = count
names_file = open('file2.txt')
output_file = open('output.txt', 'w')
for line in names_file:
gene = line.strip()
count = gene_count[gene]
output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))
output_file.close()
Upvotes: 3
Views: 1592
Reputation: 440
Have you considered using pandas
. You can load files into DataFrame
and then filter them:
In [5]: file1 = pn.read_csv('file1',sep=' ',
names=['number','word'],
engine='python')
In [6]: file1
Out[6]:
number word
0 2 word1
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In [9]: file1[(file1.number==4) | (file1.number==8)]
Out[9]:
number word
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In [13]: foo = file1[(file1.number==4) | (file1.number==8)].append(file2[(file2.number==4) | (file2.number==8)])
Out[13]:
number word
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
1 4 word2
2 4 word2_1
3 4 word2_2
4 8 word5
5 8 word5_3
In 5 you reed file, in 9 you filter file by values of numbers, in 13 you join two filtered files together.
You can then sort it and do your computation much easier then with dictionary.
UPDATE
To filter pandas.DataFrame
by condition that column value is in some list you can use isin
giving it list or using range
for example.
In [46]: file1[file1.number.isin([1,2,3])]
Out[46]:
number word
0 2 word1
Upvotes: 1
Reputation: 9039
Make the values of your dictionary, lists, and append to them.
In general:
from collections import defaultdict
my_dict = defaultdict(lambda: [])
for x in xrange(101):
if x % 2 == 0:
my_dict['evens'].append(str(x))
else:
my_dict['odds'].append(str(x))
print 'evens:', ' '.join(my_dict['evens'])
print 'odds:', ' '.join(my_dict['odds'])
In your case, your values are lists, so add (concatenate) the lists to the lists of your dictionary:
from collections import defaultdict
gene_count = defaultdict(lambda: [])
my_file = open('file1.txt')
for line in my_file:
columns = line.strip().split()
gene = columns[0]
count = columns[1:13]
gene_count[gene] += count
names_file = open('file2.txt')
output_file = open('output.txt', 'w')
for line in names_file:
gene = line.strip()
count = gene_count[gene]
output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))
output_file.close()
If what you actually want to print is the count for each gene, then replace "\t".join(count)
with len(count)
, the length of the list.
Upvotes: 1