Reputation: 696
I need to extract all numbers from some free text as a list using pyparsing. Numbers will include scientific notation.
This is my grammar:
digits = '0123456789'
#straight number = 5, 10 ,65535
strt_num = pp.Word(digits)
decimal = pp.Literal('.')
dec_num = strt_num+decimal+strt_num
multiply = pp.Literal('×')
minus = pp.Literal('−')
sci_num = (dec_num ^ strt_num)+multiply+'10'+pp.Optional(minus)+strt_num
num = sci_num ^ dec_num ^ strt_num
num.parseString('5 × 10−5 and then there is also 0.0001')
This gives me:
(['5', '\xc3\x97', '10', '\xe2\x88\x92', '5'], {})
Which has two problems:
For problem 1, I tried to use the Combine class from documentation, like this at the end:
num = pp.Combine(sci_num ^ dec_num ^ strt_num)
but this stops matching the whole number for some reason and just gives me this:
(['5'], {})
For problem 2, I can't find anything in the documentation similar to "findall". The only option is see is to make n-grams (like starting from 5 grams or something), see if they match and then make n smaller if not. The text between numbers can be anything (its not something clean like a comma separated list of numbers as in other questions I've seen here)
But I feel like there must be a better way. Any help is appreciated. Thanks!
Upvotes: 1
Views: 495
Reputation: 63782
To have pyparsing do the string concatenation for you, change dec_num to :
dec_num = pp.Combine(strt_num+decimal+strt_num)
Upvotes: 1
Reputation: 696
I just needed to use searchString instead of parseString (for problem 2) and use asList() method to get a list of lists (of matched characters). Then I just join the individual lists to get strings (for problem 1).
Upvotes: 1