Reputation: 463
I'm trying to split lines of text and store key information in a dictionary.
For example I have lines that look like:
Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
For the first line, my key will be "Lasal_00010", and the value I'm storing is "H293".
My current code works fine for this case, but when I encounter a line like:
Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
my code will not store the string "SSCG".
Here is my current code:
dataHash = {}
with open(fasta,'r') as f:
for ln in f:
query = ln.split('\t')[0]
query.strip()
tempValue = ln.split('\t')[1]
value = tempValue.split('|')[0]
value.strip()
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
for x in dataHash:
print x + " " + str(dataHash[x])
I believe I am splitting the line incorrectly in the case with two vertical bars. But I'm confused as to where my problem is. Shouldn't "SSCG" be the value I get when I write value = tempValue.split('|')[0]
? Can someone explain to me how split works or what I'm missing?
Upvotes: 0
Views: 340
Reputation: 1121972
Split on the first pipe, then on the space:
with open(fasta,'r') as f:
for ln in f:
query, value = ln.partition('|')[0].split()
I used str.partition()
here as you only need to split once.
Your code makes assumptions on where tabs are being used; by splitting on the first pipe first we get to ignore the rest of the line altogether, making it a lot simpler to split the first from the second column.
Demo:
>>> lines = '''\
... Lasal_00010 H293|H293_08936 42.37 321 164 8 27 344 37 339 7e-74 236
... Lasal_00010 SSEG|SSEG_00350 43.53 317 156 9 30 342 42 339 7e-74 240
... Lasal_00030 SSCG|pSCL4|SSCG_06461 27.06 218 83 6 37 230 35 200 5e-11 64.3
... '''
>>> for ln in lines.splitlines(True):
... query, value = ln.partition('|')[0].split()
... print query, value
...
Lasal_00010 H293
Lasal_00010 SSEG
Lasal_00030 SSCG
However, your code works too, up to a point, albeit less efficiently. Your real problem is with:
if not dataHash.has_key(query):
dataHash[query] = ''
else:
dataHash[query] = value
This really means: First time I see query
, store an empty string, otherwise store value
. I am not sure why you do this; if there are no other lines starting with Lasal_00030
, all you have is an empty value in the dictionary. If that wasn't the intention, just store the value:
dataHash[query] = value
No if
statement.
Note that dict.has_key()
has been deprecated; it is better to use in
to test for a key:
if query not in dataHash:
Upvotes: 4