Anon
Anon

Reputation: 875

Inconsistency in string parsing of python

I'm trying to parse strings in python. I have posted a couple of questions on stack overflow and I was basically trying to combine the functionality of all the different possible ways of parsing the strings I am working with.

Here's a code snippet that works just fine in isolation to parse the two following string formats.

from __future__ import generators
from pprint import pprint
s2="<one><two><three> an.attribute ::"
s1="< one > < two > < three > here's one attribute < six : 10.3 > < seven : 8.5 > <   eight :   90.1 > < nine : 8.7 >"
def parse(s):
    for t in s.split('<'):
        for u in t.strip().split('>',1):
            if u.strip(): yield u.strip()
pprint(list(parse(s1)))
pprint(list(parse(s2)))

Here's the output that I get. It's in the format that I need where each attribute is stored in a different index location.

['one',
 'two',
 'three',
 "here's one attribute",
 'six : 10.3',
 'seven : 8.5',
 'eight : 90.1',
 'nine : 8.7']
['one', 'two', 'three', 'an.attribute ::']

After that was done, I tried to incorporate the same code into a function which can parse four string formats but for some reason it doesn't seem to work here and I cant figure out why.

Here's the incorporated code in its entirety.

from __future__ import generators
import re
import string
from pprint import pprint
temp=[]
y=[]
s2="< one > < two > < three > an.attribute ::"
s1="< one > < two > < three > here's an attribute < four : 6.5 > < five : 7.5 > < six : 8.5 > < seven : 9.5 >"
t2="< one > < two > < three > < four : 220.0 > < five : 6.5 > < six : 7.5 > < seven : 8.5 > < eight : 9.5 > < nine : 6 -  7 >"
t3="One : two :  three : four  Value  : five  Value  : six  Value : seven  Value :  eight  Value :"
def parse(s):
    c=s.count('<')
    print c
    if c==9:
        res = re.findall('< (.*?) >', s)
        return res
    elif (c==7|c==3):
        temp=parsing(s)
        pprint(list(temp))
        #pprint(list(parsing(s)))
    else: 
        res=s.split(' : ')
        res = [item.strip() for item in s.split(':')]
        return res
def parsing(s):
    for t in s.split(' < '):
        for u in t.strip().split('>',1):
            if u.strip(): yield u.strip()
    pprint(list((s)))

Now when I compile the code and call parse(s1) I get the following as the output:

7
["< one > < two > < three > here's an attribute < four",
 '6.5 > < five',
 '7.5 > < six',
 '8.5 > < seven',

Similarly, on calling parse(s2), I get:

3
['< one > < two > < three > an.attribute', '', '']
   '9.5 >']

Why is there an inconsistency in spliting the string while it is being parsed? I'm using the same code in both places.

Could someone help me figure out why this is happening? :)

Upvotes: 0

Views: 128

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1125398

You are using the binary | bitwise or operator where you should be using the or boolean operator instead:

elif (c==7|c==3):

should be

elif c==7 or c==3:

or perhaps:

elif c in (3, 7):

which is faster to boot.

Because the | operator has a different precedence than the or operator, the first statement was interpreted as (c == (7 | c) == 3) with 7 | c doing a bitwise logical operation, returning a result which is never going to be equal to both c and 3, so that always returns False:

>>> c = 7
>>> (c==7|c==3)
False
>>> c = 3
>>> (c==7|c==3)
False
>>> c==7 or c==3
True

Upvotes: 2

Related Questions