Reputation: 559
So I have file that with multiple line that look like this (space delimiter file):
A1BG P04217 VAR_018369 p.His52Arg Polymorphism rs893184 -
A1BG P04217 VAR_018370 p.His395Arg Polymorphism rs2241788 -
AAAS Q9NRG9 VAR_012804 p.Gln15Lys Disease - Achalasia
How do I make dictionary to look for id in second column and store the number (between words) on fourth column.
I tried this but it give me index of out range
lookup = defaultdict(list)
with open ('humsavar.txt', 'r') as humsavarTxt:
for line in csv.reader(humsavarTxt):
code = re.match('[a-z](\d+)[a-z]', line[1], re.I)
if code:
lookup[line[-2]].append(code.group(1))
print lookup['P04217']
Upvotes: 1
Views: 451
Reputation: 35522
If you want a pure dictionary, this works:
d={}
with open(your_file,'rb') as f:
for line in f:
l=line.split()
num=int(re.search(r'(\d+)',l[3]).group(1))
d.setdefault(l[1],[]).append(num)
Prints:
{'P04217': [52, 395], 'Q9NRG9': [15]}
For a non regex solution, you can also do this:
d={}
with open(your_file,'rb') as f:
for line in f:
els=line.split()
num=int(''.join(c for c in els[3] if c.isdigit()))
d.setdefault(els[1],[]).append(num)
Upvotes: 0
Reputation: 353019
Here's a variant of the original code:
import csv, re
from collections import defaultdict
lookup = defaultdict(list)
with open('humsavar.txt', 'rb') as humsavarTxt:
reader = csv.reader(humsavarTxt, delimiter=" ", skipinitialspace=True)
for line in reader:
code = re.search(r'(\d+)', line[3])
lookup[line[1]].append(int(code.group(1)))
which produces
>>> lookup
defaultdict(<type 'list'>, {'P04217': [52, 395], 'Q9NRG9': [15]})
>>> lookup['P04217']
[52, 395]
Upvotes: 3
Reputation: 8824
If the id and the number is always in the second and fourth column, and it's always space delimited you don't need to use regular expresion. You can split on the spaces instead:
lookup = defaultdict(list)
with open ('humsavar.txt', 'r') as humsavarTxt:
for line in humsavarTxt:
lookup[line.split(' ')[1]].append(line.split(' ')[3])
Upvotes: 1