Reputation: 2683

Parsing messed up text table in python

So I have a text table which looks like the following:

BLOCK 1.  MARKERS: 1 2
42 (0.500)  |0.269  0.166   0.041   0.024|
21 (0.351)  |0.069  0.119   0.079   0.084|
22 (0.149)  |0.054  0.040   0.055   0.000|
Multiallelic Dprime: 0.295
BLOCK 2.  MARKERS: 9 10 11 12
1123 (0.392)    |0.351  0.037|
2341 (0.324)    |0.277  0.043|
2121 (0.176)    |0.016  0.164|
1121 (0.108)    |0.073  0.036|
Multiallelic Dprime: 0.591
BLOCK 3.  MARKERS: 13 14
13 (0.716)
34 (0.284)

For each block, I only need the following information:

BLOCK1:
42 0.500
21 0.351
22 0.149

I don't have any problem parsing individuals lines. And extracting what I need. Probably a list of a lists, should be my goal. My problem is that I cannot read the exact number of lines for each block, without getting an error at the end.

So I've wrote this ugly code:

file = open('haplotypes_hetero.txt')

to_parse = []

for line in file:
        to_parse.append(line.strip())

to_parse_2=[]

for line in to_parse:
        line = line.split()
        to_parse_2.append(line)

for i in range(len(to_parse_2)):
        if to_parse_2[i][0]=='BLOCK':
                z=i
                if z < len(to_parse_2):
                        z+=1
                while to_parse_2[z][0] != 'BLOCK':
                        print to_parse_2[z][0]
                        z+=1
                        if z>len(to_parse_2):
                                z=0


file.close()

It kinda works, and prints what it supposed to. However I am getting an error at the end.

42
21
22
Multiallelic
1123
2341
2121
1121
Multiallelic
13
34
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)

How do I get rid of the index error?

Upvotes: 0

Answers (4)

dawg

Reputation: 103864

You can try something like this:

table='''\
BLOCK 1.  MARKERS: 1 2
42 (0.500)  |0.269  0.166   0.041   0.024|
21 (0.351)  |0.069  0.119   0.079   0.084|
22 (0.149)  |0.054  0.040   0.055   0.000|
Multiallelic Dprime: 0.295
BLOCK 2.  MARKERS: 9 10 11 12
1123 (0.392)    |0.351  0.037|
2341 (0.324)    |0.277  0.043|
2121 (0.176)    |0.016  0.164|
1121 (0.108)    |0.073  0.036|
Multiallelic Dprime: 0.591
BLOCK 3.  MARKERS: 13 14
13 (0.716)
34 (0.284)'''

import re

d={}
for title, block in re.findall(r'^(BLOCK \d+)\..*?\n(.*?)(?=^BLOCK|\Z)', table, flags=re.M | re.S):
    d[title]=[]
    for line in block.splitlines():
        print line
        t=line.partition(')')[0].partition('(')
        try: 
            d[title].append(map(float, [t[0], t[2]]))
        except ValueError:
            pass    

for k, v in d.items():
    print k,':',v

Prints:

BLOCK 1 : [[42.0, 0.5], [21.0, 0.351], [22.0, 0.149]]
BLOCK 2 : [[1123.0, 0.392], [2341.0, 0.324], [2121.0, 0.176], [1121.0, 0.108]]
BLOCK 3 : [[13.0, 0.716], [34.0, 0.284]]

Upvotes: 2

Kasravnd

Reputation: 107287

You don't need some complex way for such problems, you can use regex :

>>> s="""BLOCK 1.  MARKERS: 1 2
... 42 (0.500)  |0.269  0.166   0.041   0.024|
... 21 (0.351)  |0.069  0.119   0.079   0.084|
... 22 (0.149)  |0.054  0.040   0.055   0.000|
... Multiallelic Dprime: 0.295
... BLOCK 2.  MARKERS: 9 10 11 12
... 1123 (0.392)    |0.351  0.037|
... 2341 (0.324)    |0.277  0.043|
... 2121 (0.176)    |0.016  0.164|
... 1121 (0.108)    |0.073  0.036|
... Multiallelic Dprime: 0.591
... BLOCK 3.  MARKERS: 13 14
... 13 (0.716)
... 34 (0.284)"""
>>> 
>>> 
>>> l=re.findall(r'((^BLOCK \d+\.)((?!BLOCK).)*)(?=BLOCK|$)',s,re.MULTILINE|re.DOTALL)
>>> [(i[-2],re.findall(r'(\d+)\s+\(([\d.]+)\)',i[0])) for i in l]
[('BLOCK 1.', [('42', '0.500'), ('21', '0.351'), ('22', '0.149')]), ('BLOCK 2.', [('1123', '0.392'), ('2341', '0.324'), ('2121', '0.176'), ('1121', '0.108')]), ('BLOCK 3.', [('13', '0.716'), ('34', '0.284')])]

First you need to extract the blocks, that you can use the following regex with re.findall :

>>> l=re.findall(r'((^BLOCK \d+\.)((?!BLOCK).)*)(?=BLOCK|$)',s,re.MULTILINE|re.DOTALL)

then you can use r'(\d+)\s+\(([\d.]+)\) to match a number that followed by 1 or more whitespace then a combination of digits with dot within a parenthesis.

As a side note ((?!BLOCK).)* will match any string that doesn't contain the word BLOCK and for for more read about the regex i suggest to check the http://www.regular-expressions.info/lookaround.html that explains about the look-around in regular expression!

Also instead of list comprehension you can use a dictionary comprehension :

>>> {i[-2]:re.findall(r'(\d+)\s+\(([\d.]+)\)',i[0]) for i in l}

{'BLOCK 1.': [('42', '0.500'), ('21', '0.351'), ('22', '0.149')], 
 'BLOCK 2.': [('1123', '0.392'), ('2341', '0.324'), ('2121', '0.176'), ('1121', '0.108')], 
 'BLOCK 3.': [('13', '0.716'), ('34', '0.284')]}

Upvotes: 1

Kashyap

Reputation: 17451

Sorry, couldn't wait any longer..

>>> s='''BLOCK 1.  MARKERS: 1 2
... ... 42 (0.500)  |0.269  0.166   0.041   0.024|
... ... 21 (0.351)  |0.069  0.119   0.079   0.084|
... ... 22 (0.149)  |0.054  0.040   0.055   0.000|
... ... Multiallelic Dprime: 0.295
... ... BLOCK 2.  MARKERS: 9 10 11 12
... ... 1123 (0.392)    |0.351  0.037|
... ... 2341 (0.324)    |0.277  0.043|
... ... 2121 (0.176)    |0.016  0.164|
... ... 1121 (0.108)    |0.073  0.036|
... ... Multiallelic Dprime: 0.591
... ... BLOCK 3.  MARKERS: 13 14
... ... 13 (0.716)
... ... 34 (0.284)'''
>>> re.findall(r'(?:(\d+)\s+\(([\d.]+)\)|(BLOCK \d+))',s)
[('', '', 'BLOCK 1'), ('42', '0.500', ''), ('21', '0.351', ''), ('22', '0.149', ''), ('', '', 'BLOCK 2'), ('1123', '0.392', ''), ('2341', '0.324', ''), ('2121', '0.176', ''), ('1121', '0.108', ''), ('', '', 'BLOCK 3'), ('13', '0.716', ''), ('34', '0.284', '')]

This:

file = open('haplotypes_hetero.txt')

to_parse = []

for line in file:
        to_parse.append(line.strip())

to_parse_2=[]

for line in to_parse:
        line = line.split()
        to_parse_2.append(line)

can be replaced with:

to_parse_2 = [ l.split() for l in open('haplotypes_hetero.txt').realines() ]

I highly recommend learning python's list comprehensions

Upvotes: 2

Chiyaan Suraj

Reputation: 1015

I think the problem is with this

if z>len(to_parse_2):
      z=0

because your program is checking only if the Z value becomes greater than length of list. It shouldn't increment Z when the Z value becomes equal to length of list. So change those lines to

if z >= len(to_parse_2) : #changed '>' to >=
      z=0

Upvotes: 3

Parsing messed up text table in python

Answers (4)

Related Questions