Vince
Vince

Reputation: 11

Retrieving the top value in a dictionary that has multiple values under a single key

I am somewhat new to python and i have a problem. I have a file with 5 results for each unique identifier. Each result has a percent match, and various other pieces of data. My goal is to find the result with the greatest percent match, and then retrieve more information from that original line. For example

Name    Organism    Percent Match     Misc info
1        Human        100              xxx     
1        Goat          95              yyy
1        Pig           90              zzz   

I am attempting to solve this problem by putting each key in a dictionary with the values being each percent match unique to the given name (ie multiple values for every key). The only way I can think to proceed is to convert the values in this dictionary to a list, then sort the list. I then want to retrieve the greatest value in the list (list[0] or list[-1]) and then retrieve more info from the original line. Here is my code thus far

list = []  
if "1" in line: 
    id = line
    bsp = id.split("\t")
    uid = bsp[0]
    per = bsp[2]

    if not dict.has_key(uid):
        dict[uid] = []
    dict[uid].append(per)
    list = dict[uid]
    list.sort()
if list[0] in dict:
    print key

This ends up just printing every key, as opposed to only that which has the greatest percent. Any thoughts? Thanks!

Upvotes: 1

Views: 568

Answers (4)

unutbu
unutbu

Reputation: 880359

You could use csv to parse the tab-delineated data file, (though the data you posted looks to be column-spaced data!?)

Since the first line in your data file gives field names, a DictReader is convenient, so you can refer to the columns by human-readable names.

csv.DictReader returns an iterable of rows (dicts). If you take the max of the iterable using the Percent Match column as the key, you can find the row with the highest percent match:

Using this (tab-delimited) data as test.dat:

Name    Organism    Percent Match   Misc    info
1   Human   100 xxx
1   Goat    95  yyy
1   Pig 90  zzz
2   Mouse   95  yyy
2   Moose   90  zzz
2   Manatee 100 xxx

the program

import csv

maxrows = {}
with open('test.dat', 'rb') as f:
    for row in csv.DictReader(f, delimiter = '\t'):
        name = row['Name']
        percent = int(row['Percent Match'])
        if int(maxrows.get(name,row)['Percent Match']) <= percent:
            maxrows[name] = row

print(maxrows)

yields

{'1': {'info': None, 'Percent Match': '100', 'Misc': 'xxx', 'Organism': 'Human', 'Name': '1'}, '2': {'info': None, 'Percent Match': '100', 'Misc': 'xxx', 'Organism': 'Manatee', 'Name': '2'}}

Upvotes: 2

Rik Poggi
Rik Poggi

Reputation: 29302

I think you may be looking for something like:

from collections import defaultdict

results = defaultdict(list)
with open('data.txt') as f:
    #next(f)      # you may need this so skip the header
    for line in f:
        splitted = line.split()
        results[splitted[0]].append(splitted[1:])

maxs = {}
for uid,data in results.items():
    maxs[uid] =  max(data, key=lambda k: int(k[1]))

I've testif on a file like:

Name    Organism    Percent Match     Misc info
1        Human        100              xxx     
1        Goat          95              yyy
1        Pig           90              zzz   
2        Pig           85              zzz   
2        Goat          70              yyy

And the result was:

{'1': ['Human', '100', 'xxx'], '2': ['Pig', '85', 'zzz']}

Upvotes: 1

Andrew Clark
Andrew Clark

Reputation: 208555

You should be able to do something like this:

lines = []
with open('data.txt') as file:
    for line in file:
        if line.startswith('1'):
            lines.append(line.split())

best_match = max(lines, key=lambda k: int(k[2]))

After reading the file lines would look something like this:

>>> pprint.pprint(lines)
[['1', 'Human', '100', 'xxx'],
 ['1', 'Goat', '95', 'yyy'],
 ['1', 'Pig', '90', 'zzz']]

And then you want to get the entry from lines where the int value of the third item is the highest, which can be expressed like this:

>>> max(lines, key=lambda k: int(k[2]))
['1', 'Human', '100', 'xxx']

So at the end of this best_match will be a list with the data from the line you are interested in.

Or if you wanted to get really tricky, you could get the line in one (complicated) step:

with open('data.txt') as file:
    best_match = max((s.split() for s in file if s.startswith('1')),
                     key=lambda k: int(k[2]))

Upvotes: 1

Joel Cornett
Joel Cornett

Reputation: 24788

with open('datafile.txt', 'r') as f:
    lines = file.read().split('\n')

matchDict = {}

for line in lines:
    if line[0] == '1':
        uid, organism, percent, misc = line.split('\t')
        matchDict[int(percent)] = (organism, uid, misc)

highestMatch = max(matchDict.keys())

print('{0} is the highest match at {1} percent'.format(matchDict[highestMatch][0], highestMatch))

Upvotes: 0

Related Questions