Reputation: 11
I am somewhat new to python and i have a problem. I have a file with 5 results for each unique identifier. Each result has a percent match, and various other pieces of data. My goal is to find the result with the greatest percent match, and then retrieve more information from that original line. For example
Name Organism Percent Match Misc info
1 Human 100 xxx
1 Goat 95 yyy
1 Pig 90 zzz
I am attempting to solve this problem by putting each key in a dictionary with the values being each percent match unique to the given name (ie multiple values for every key). The only way I can think to proceed is to convert the values in this dictionary to a list, then sort the list. I then want to retrieve the greatest value in the list (list[0] or list[-1]) and then retrieve more info from the original line. Here is my code thus far
list = []
if "1" in line:
id = line
bsp = id.split("\t")
uid = bsp[0]
per = bsp[2]
if not dict.has_key(uid):
dict[uid] = []
dict[uid].append(per)
list = dict[uid]
list.sort()
if list[0] in dict:
print key
This ends up just printing every key, as opposed to only that which has the greatest percent. Any thoughts? Thanks!
Upvotes: 1
Views: 568
Reputation: 880359
You could use csv
to parse the tab-delineated data file, (though the data you posted looks to be column-spaced data!?)
Since the first line in your data file gives field names, a DictReader is convenient, so you can refer to the columns by human-readable names.
csv.DictReader
returns an iterable of rows (dicts). If you take the max
of the iterable using the Percent Match
column as the key
, you can find the row with the highest percent match:
Using this (tab-delimited) data as test.dat
:
Name Organism Percent Match Misc info
1 Human 100 xxx
1 Goat 95 yyy
1 Pig 90 zzz
2 Mouse 95 yyy
2 Moose 90 zzz
2 Manatee 100 xxx
the program
import csv
maxrows = {}
with open('test.dat', 'rb') as f:
for row in csv.DictReader(f, delimiter = '\t'):
name = row['Name']
percent = int(row['Percent Match'])
if int(maxrows.get(name,row)['Percent Match']) <= percent:
maxrows[name] = row
print(maxrows)
yields
{'1': {'info': None, 'Percent Match': '100', 'Misc': 'xxx', 'Organism': 'Human', 'Name': '1'}, '2': {'info': None, 'Percent Match': '100', 'Misc': 'xxx', 'Organism': 'Manatee', 'Name': '2'}}
Upvotes: 2
Reputation: 29302
I think you may be looking for something like:
from collections import defaultdict
results = defaultdict(list)
with open('data.txt') as f:
#next(f) # you may need this so skip the header
for line in f:
splitted = line.split()
results[splitted[0]].append(splitted[1:])
maxs = {}
for uid,data in results.items():
maxs[uid] = max(data, key=lambda k: int(k[1]))
I've testif on a file like:
Name Organism Percent Match Misc info
1 Human 100 xxx
1 Goat 95 yyy
1 Pig 90 zzz
2 Pig 85 zzz
2 Goat 70 yyy
And the result was:
{'1': ['Human', '100', 'xxx'], '2': ['Pig', '85', 'zzz']}
Upvotes: 1
Reputation: 208555
You should be able to do something like this:
lines = []
with open('data.txt') as file:
for line in file:
if line.startswith('1'):
lines.append(line.split())
best_match = max(lines, key=lambda k: int(k[2]))
After reading the file lines
would look something like this:
>>> pprint.pprint(lines)
[['1', 'Human', '100', 'xxx'],
['1', 'Goat', '95', 'yyy'],
['1', 'Pig', '90', 'zzz']]
And then you want to get the entry from lines
where the int
value of the third item is the highest, which can be expressed like this:
>>> max(lines, key=lambda k: int(k[2]))
['1', 'Human', '100', 'xxx']
So at the end of this best_match
will be a list with the data from the line you are interested in.
Or if you wanted to get really tricky, you could get the line in one (complicated) step:
with open('data.txt') as file:
best_match = max((s.split() for s in file if s.startswith('1')),
key=lambda k: int(k[2]))
Upvotes: 1
Reputation: 24788
with open('datafile.txt', 'r') as f:
lines = file.read().split('\n')
matchDict = {}
for line in lines:
if line[0] == '1':
uid, organism, percent, misc = line.split('\t')
matchDict[int(percent)] = (organism, uid, misc)
highestMatch = max(matchDict.keys())
print('{0} is the highest match at {1} percent'.format(matchDict[highestMatch][0], highestMatch))
Upvotes: 0