Reputation: 2683
So I have a text table which looks like the following:
BLOCK 1. MARKERS: 1 2
42 (0.500) |0.269 0.166 0.041 0.024|
21 (0.351) |0.069 0.119 0.079 0.084|
22 (0.149) |0.054 0.040 0.055 0.000|
Multiallelic Dprime: 0.295
BLOCK 2. MARKERS: 9 10 11 12
1123 (0.392) |0.351 0.037|
2341 (0.324) |0.277 0.043|
2121 (0.176) |0.016 0.164|
1121 (0.108) |0.073 0.036|
Multiallelic Dprime: 0.591
BLOCK 3. MARKERS: 13 14
13 (0.716)
34 (0.284)
For each block, I only need the following information:
BLOCK1:
42 0.500
21 0.351
22 0.149
I don't have any problem parsing individuals lines. And extracting what I need. Probably a list of a lists, should be my goal. My problem is that I cannot read the exact number of lines for each block, without getting an error at the end.
So I've wrote this ugly code:
file = open('haplotypes_hetero.txt')
to_parse = []
for line in file:
to_parse.append(line.strip())
to_parse_2=[]
for line in to_parse:
line = line.split()
to_parse_2.append(line)
for i in range(len(to_parse_2)):
if to_parse_2[i][0]=='BLOCK':
z=i
if z < len(to_parse_2):
z+=1
while to_parse_2[z][0] != 'BLOCK':
print to_parse_2[z][0]
z+=1
if z>len(to_parse_2):
z=0
file.close()
It kinda works, and prints what it supposed to. However I am getting an error at the end.
42
21
22
Multiallelic
1123
2341
2121
1121
Multiallelic
13
34
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
How do I get rid of the index error?
Upvotes: 0
Views: 117
Reputation: 103864
You can try something like this:
table='''\
BLOCK 1. MARKERS: 1 2
42 (0.500) |0.269 0.166 0.041 0.024|
21 (0.351) |0.069 0.119 0.079 0.084|
22 (0.149) |0.054 0.040 0.055 0.000|
Multiallelic Dprime: 0.295
BLOCK 2. MARKERS: 9 10 11 12
1123 (0.392) |0.351 0.037|
2341 (0.324) |0.277 0.043|
2121 (0.176) |0.016 0.164|
1121 (0.108) |0.073 0.036|
Multiallelic Dprime: 0.591
BLOCK 3. MARKERS: 13 14
13 (0.716)
34 (0.284)'''
import re
d={}
for title, block in re.findall(r'^(BLOCK \d+)\..*?\n(.*?)(?=^BLOCK|\Z)', table, flags=re.M | re.S):
d[title]=[]
for line in block.splitlines():
print line
t=line.partition(')')[0].partition('(')
try:
d[title].append(map(float, [t[0], t[2]]))
except ValueError:
pass
for k, v in d.items():
print k,':',v
Prints:
BLOCK 1 : [[42.0, 0.5], [21.0, 0.351], [22.0, 0.149]]
BLOCK 2 : [[1123.0, 0.392], [2341.0, 0.324], [2121.0, 0.176], [1121.0, 0.108]]
BLOCK 3 : [[13.0, 0.716], [34.0, 0.284]]
Upvotes: 2
Reputation: 107287
You don't need some complex way for such problems, you can use regex
:
>>> s="""BLOCK 1. MARKERS: 1 2
... 42 (0.500) |0.269 0.166 0.041 0.024|
... 21 (0.351) |0.069 0.119 0.079 0.084|
... 22 (0.149) |0.054 0.040 0.055 0.000|
... Multiallelic Dprime: 0.295
... BLOCK 2. MARKERS: 9 10 11 12
... 1123 (0.392) |0.351 0.037|
... 2341 (0.324) |0.277 0.043|
... 2121 (0.176) |0.016 0.164|
... 1121 (0.108) |0.073 0.036|
... Multiallelic Dprime: 0.591
... BLOCK 3. MARKERS: 13 14
... 13 (0.716)
... 34 (0.284)"""
>>>
>>>
>>> l=re.findall(r'((^BLOCK \d+\.)((?!BLOCK).)*)(?=BLOCK|$)',s,re.MULTILINE|re.DOTALL)
>>> [(i[-2],re.findall(r'(\d+)\s+\(([\d.]+)\)',i[0])) for i in l]
[('BLOCK 1.', [('42', '0.500'), ('21', '0.351'), ('22', '0.149')]), ('BLOCK 2.', [('1123', '0.392'), ('2341', '0.324'), ('2121', '0.176'), ('1121', '0.108')]), ('BLOCK 3.', [('13', '0.716'), ('34', '0.284')])]
First you need to extract the blocks, that you can use the following regex with re.findall
:
>>> l=re.findall(r'((^BLOCK \d+\.)((?!BLOCK).)*)(?=BLOCK|$)',s,re.MULTILINE|re.DOTALL)
then you can use r'(\d+)\s+\(([\d.]+)\)
to match a number that followed by 1 or more whitespace then a combination of digits with dot within a parenthesis.
As a side note ((?!BLOCK).)*
will match any string that doesn't contain the word BLOCK
and for for more read about the regex i suggest to check the http://www.regular-expressions.info/lookaround.html that explains about the look-around
in regular expression!
Also instead of list comprehension you can use a dictionary comprehension :
>>> {i[-2]:re.findall(r'(\d+)\s+\(([\d.]+)\)',i[0]) for i in l}
{'BLOCK 1.': [('42', '0.500'), ('21', '0.351'), ('22', '0.149')],
'BLOCK 2.': [('1123', '0.392'), ('2341', '0.324'), ('2121', '0.176'), ('1121', '0.108')],
'BLOCK 3.': [('13', '0.716'), ('34', '0.284')]}
Upvotes: 1
Reputation: 17451
Sorry, couldn't wait any longer..
>>> s='''BLOCK 1. MARKERS: 1 2
... ... 42 (0.500) |0.269 0.166 0.041 0.024|
... ... 21 (0.351) |0.069 0.119 0.079 0.084|
... ... 22 (0.149) |0.054 0.040 0.055 0.000|
... ... Multiallelic Dprime: 0.295
... ... BLOCK 2. MARKERS: 9 10 11 12
... ... 1123 (0.392) |0.351 0.037|
... ... 2341 (0.324) |0.277 0.043|
... ... 2121 (0.176) |0.016 0.164|
... ... 1121 (0.108) |0.073 0.036|
... ... Multiallelic Dprime: 0.591
... ... BLOCK 3. MARKERS: 13 14
... ... 13 (0.716)
... ... 34 (0.284)'''
>>> re.findall(r'(?:(\d+)\s+\(([\d.]+)\)|(BLOCK \d+))',s)
[('', '', 'BLOCK 1'), ('42', '0.500', ''), ('21', '0.351', ''), ('22', '0.149', ''), ('', '', 'BLOCK 2'), ('1123', '0.392', ''), ('2341', '0.324', ''), ('2121', '0.176', ''), ('1121', '0.108', ''), ('', '', 'BLOCK 3'), ('13', '0.716', ''), ('34', '0.284', '')]
This:
file = open('haplotypes_hetero.txt')
to_parse = []
for line in file:
to_parse.append(line.strip())
to_parse_2=[]
for line in to_parse:
line = line.split()
to_parse_2.append(line)
can be replaced with:
to_parse_2 = [ l.split() for l in open('haplotypes_hetero.txt').realines() ]
I highly recommend learning python's list comprehensions
Upvotes: 2
Reputation: 1015
I think the problem is with this
if z>len(to_parse_2):
z=0
because your program is checking only if the Z value becomes greater than length of list. It shouldn't increment Z when the Z value becomes equal to length of list. So change those lines to
if z >= len(to_parse_2) : #changed '>' to >=
z=0
Upvotes: 3