user3188922
user3188922

Reputation: 349

Extract: Python dictionary, key with multiple values

I have two files and I am trying to extract some values from file 1, like this:

File1:
2    word1
4    word2
4    word2_1
4    word2_2
8    word5
8    word5_3

File 2:
4
8

What I want is to extract every lines starting by 4 and 8 (from file 2) and they are lots. So usually if only one line would match I would use a python dictionary, one key one element easy! But now that I have multiple element matching to the same key, my script would only extract the last one (obviously as it goes along it will erase previous ones!). So I get this is not how it works but I have no idea and I would be very happy if someone can help me start.

Here is my "usual" code:

gene_count = {}
my_file = open('file1.txt')
for line in my_file:
    columns = line.strip().split()
    gene = columns[0]
    count = columns[1:13]
    gene_count[gene] = count

names_file = open('file2.txt')
output_file = open('output.txt', 'w')

for line in names_file:
    gene = line.strip()
    count = gene_count[gene]
    output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))

output_file.close()

Upvotes: 3

Views: 1592

Answers (2)

Pawel Wisniewski
Pawel Wisniewski

Reputation: 440

Have you considered using pandas. You can load files into DataFrame and then filter them:

In [5]: file1 = pn.read_csv('file1',sep='    ', 
                            names=['number','word'],
                            engine='python')

In [6]: file1
Out[6]: 
   number     word
0       2    word1
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3

In [9]: file1[(file1.number==4) | (file1.number==8)]
Out[9]: 
   number     word
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3

In [13]: foo = file1[(file1.number==4) | (file1.number==8)].append(file2[(file2.number==4) | (file2.number==8)])
Out[13]: 
   number     word
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3
1       4    word2
2       4  word2_1
3       4  word2_2
4       8    word5
5       8  word5_3

In 5 you reed file, in 9 you filter file by values of numbers, in 13 you join two filtered files together.
You can then sort it and do your computation much easier then with dictionary.

UPDATE
To filter pandas.DataFrame by condition that column value is in some list you can use isin giving it list or using range for example.

In [46]: file1[file1.number.isin([1,2,3])]
Out[46]: 
   number   word
0       2  word1

Upvotes: 1

OregonTrail
OregonTrail

Reputation: 9039

Make the values of your dictionary, lists, and append to them.

In general:

from collections import defaultdict
my_dict = defaultdict(lambda: [])

for x in xrange(101):
    if x % 2 == 0:
        my_dict['evens'].append(str(x))
    else:
        my_dict['odds'].append(str(x))

print 'evens:', ' '.join(my_dict['evens'])
print 'odds:', ' '.join(my_dict['odds'])

In your case, your values are lists, so add (concatenate) the lists to the lists of your dictionary:

from collections import defaultdict
gene_count = defaultdict(lambda: [])

my_file = open('file1.txt')
for line in my_file:
    columns = line.strip().split()
    gene = columns[0]
    count = columns[1:13]
    gene_count[gene] += count

names_file = open('file2.txt')
output_file = open('output.txt', 'w')

for line in names_file:
    gene = line.strip()
    count = gene_count[gene]
    output_file.write('{0}\t{1}\n'.format(gene,"\t".join(count)))

output_file.close()

If what you actually want to print is the count for each gene, then replace "\t".join(count) with len(count), the length of the list.

Upvotes: 1

Related Questions