M4rk
M4rk

Reputation: 2272

Match more than 2 spaces with PyParsing

I have a string like the following:

date                Not Important                         value    NotImportant2
11.11.13            useless . useless,21 useless 2        14.21    asmdakldm
21.12.12            fmpaosmfpoamsp 4                      41       ajfa9si90

I have to extract only the date and the value at the end.

If I use the standard procedure to match multiple words, pyparsing matches the last number of "Not Important" column as the "value".

    anything = pp.Forward()
    anything << anyword + (value | anything)
    myParser = date + anything

I think that the bestway is forcing pyparsing to match at least 2 whitespaces, but I really don't know how. Any advice?

Upvotes: 1

Views: 409

Answers (2)

Ro Yo Mi
Ro Yo Mi

Reputation: 15000

Description

To match 2 or more spaces you could use \s{2,}

This expression will:

  • capture the date field
  • capture the second to last field

^(\d{2}\.\d{2}\.\d{2})[^\r\n]*\s(\S+)\s{2,}\S+\s*(?:[\r\n]|\Z)

enter image description here

Examples

Live Demo

Sample Text

date                Not Important                         value    NotImportant2
11.11.13            useless . useless,21 useless 2        14.21    asmdakldm
21.12.12            fmpaosmfpoamsp 4                      41       ajfa9si90

Matches

[0][0] = 11.11.13            useless . useless,21 useless 2        14.21    asmdakldm

[0][3] = 11.11.13
[0][4] = 14.21

[1][0] = 21.12.12            fmpaosmfpoamsp 4                      41       ajfa9si90
[1][5] = 21.12.12
[1][6] = 41

Upvotes: 2

PaulMcG
PaulMcG

Reputation: 63729

This sample text is columnar, so pyparsing is somewhat overkill here. You could just write:

fieldslices = [slice(0,8), # dateslice
               slice(58,58+8), # valueslice
              ]

for line in sample:
    date,value = (line[x] for x in fieldslices)
    print date,value.strip()

and get:

date     value
11.11.13 14.21
21.12.12 41

But since you specifically wanted a pyparsing solution, then for something so columny, you can use the GoToColumn class:

from pyparsing import *

dateExpr = Regex(r'(\d\d\.){2}\d\d').setName("date")
realNum = Regex(r'\d+\.\d*').setName("real").setParseAction(lambda t:float(t[0]))
intNum = Regex(r'\d+').setName("integer").setParseAction(lambda t:int(t[0]))
valueExpr = realNum | intNum

patt = dateExpr("date") + GoToColumn(59) + valueExpr("value")

GoToColumn is similar to SkipTo, but instead of advancing to the next instance of an expression, it just advances to a particular column number (where column numbers are 1-based, not 0-based like in string slicing).

Now here is that parser applied to your sample text:

# Normally, input would be from some text file
# infile = open(sourcefile)
# but for this example, create iterator from the sample 
# text instead
sample = """\
date                Not Important                         value    NotImportant2
11.11.13            useless . useless,21 useless 2        14.21    asmdakldm
21.12.12            fmpaosmfpoamsp 4                      41       ajfa9si90
""".splitlines()

infile = iter(sample)

# skip header line
next(infile) 

for line in infile:
    result = patt.parseString(line)
    print result.dump()
    print

Prints:

['11.11.13', 'useless . useless,21 useless 2        ', 14.210000000000001]
- date: 11.11.13
- value: 14.21

['21.12.12', 'fmpaosmfpoamsp 4                      ', 41]
- date: 21.12.12
- value: 41

Note that the values have already been converted from strings to int or float type; you can do the same for yourself to write a parse action that converts your dd.mm.yy dates to Python datetimes. Also note how the associated results names are defined; these allow you to access the individual fields by name, like print result.date.

I also noticed your assumption that to get a sequence of one or more elements, you used this construct:

anything = pp.Forward()
anything << anyword + (value | anything)

While this does work, it creates a runtime-costly recursive expression. pyparsing offers an iterative equivalent, OneOrMore:

anything = OneOrMore(anyword)

Or if you prefer the newer '*'-operator form:

anything = anyword*(1,)

Please take a scan through the pyparsing API docs, which are included in the source distribution of pyparsing, or online at http://packages.python.org/pyparsing/.

Welcome to Pyparsing!

Upvotes: 1

Related Questions