AzureMinotaur
AzureMinotaur

Reputation: 696

Geting all numbers as a list from a string with pyparsing

I need to extract all numbers from some free text as a list using pyparsing. Numbers will include scientific notation.

This is my grammar:

digits = '0123456789'
#straight number = 5, 10 ,65535
strt_num = pp.Word(digits)
decimal = pp.Literal('.')
dec_num = strt_num+decimal+strt_num

multiply = pp.Literal('×')
minus = pp.Literal('−')

sci_num = (dec_num ^ strt_num)+multiply+'10'+pp.Optional(minus)+strt_num

num = sci_num ^ dec_num ^ strt_num

num.parseString('5 × 10−5 and then there is also 0.0001')

This gives me:

(['5', '\xc3\x97', '10', '\xe2\x88\x92', '5'], {})

Which has two problems:

  1. It gives me the number as different matched parts (instead of a single string)
  2. It only gives me the first matched number

For problem 1, I tried to use the Combine class from documentation, like this at the end:

num = pp.Combine(sci_num ^ dec_num ^ strt_num)

but this stops matching the whole number for some reason and just gives me this:

(['5'], {})

For problem 2, I can't find anything in the documentation similar to "findall". The only option is see is to make n-grams (like starting from 5 grams or something), see if they match and then make n smaller if not. The text between numbers can be anything (its not something clean like a comma separated list of numbers as in other questions I've seen here)

But I feel like there must be a better way. Any help is appreciated. Thanks!

Upvotes: 1

Views: 495

Answers (2)

PaulMcG
PaulMcG

Reputation: 63782

To have pyparsing do the string concatenation for you, change dec_num to :

dec_num = pp.Combine(strt_num+decimal+strt_num)

Upvotes: 1

AzureMinotaur
AzureMinotaur

Reputation: 696

I just needed to use searchString instead of parseString (for problem 2) and use asList() method to get a list of lists (of matched characters). Then I just join the individual lists to get strings (for problem 1).

Upvotes: 1

Related Questions