Reputation: 792

Python Regular expression Lookahead overshooting pattern

I'm trying to pull the data contained within FTP LIST.

I'm using regex within Python 2.7.

test = "-rw-r--r--   1 owner    group        75148624 Jan  6  2015 somename.csv-rw-r--r--   1 owner    group       223259072 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group         4041411 Jun  5  2015 somename-adjusted.csv-rw-r--r--   1 owner    group         2879228 May 13  2015 somename.csv-rw-r--r--   1 owner    group        11832668 Feb 13  2015 somename.csv-rw-r--r--   1 owner    group         1510522 Feb 19  2015 somename.csv-rw-r--r--   1 owner    group         2826664 Feb 25  2015 somename.csv-rw-r--r--   1 owner    group          582985 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group          212427 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group         3015592 Feb 27  2015 somename.csv-rw-r--r--   1 owner    group          103576 Feb 27  2015    somename-corrected.csv"

(now without code formatting so you can see it without scrolling)

test = "-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"

I've tried various incarnations of the following

from re import compile
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
                      '(?P<links>[0-9]{1,8})[\s]{1,20}'
                      '(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
                      '(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
                      '(?P<size>[0-9]{1,16})[\s]{1,20}'
                      '(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
                      '(?P<date>[0-9]{1,2})[\s]{1,20}'
                      '(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
                      '(?P<filename>[\s\w\.\-]+)(?=[drwx\-]{10})')

with the last line as

'(?P<filename>.+)(?=[drwx\-]{10})')

'(?P<filename>.+(?=[drwx\-]{10}))')

and originally,

'(?P<filename>[\s\w\.\-]+(?=[drwx\-]{10}|$))')

so i can capture the last entry

but regardless, I keep getting the following output

ftp_list_re.findall(test)

[('-rw-r--r--',
  '1',
  'owner',
  'group',
  '75148624',
  'Jan',
  '6',
  '2015',
  'somename.csv-rw-r--r--   1 owner    group       223259072 Feb 26  2015     somename.csv-rw-r--r--   1 owner    group         4041411 Jun  5  2015 somename-adjusted.csv-rw-r--r--   1 owner    group         2879228 May 13  2015 somename.csv-rw-r--r--   1 owner    group        11832668 Feb 13  2015 somename.csv-rw-r--r--   1 owner    group         1510522 Feb 19  2015 somename.csv-rw-r--r--   1 owner    group         2826664 Feb 25  2015 somename.csv-rw-r--r--   1 owner    group          582985 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group          212427 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group         3015592 Feb 27  2015 somename.csv')]

What am I doing wrong?

Upvotes: 4

Answers (4)

Casimir et Hippolyte

Reputation: 89557

With the PyPi regex module that allows to split with an empty match, you can do the same in a more simple way, without having to describe all fields:

import regex

fields = ('permissions', 'links', 'owner', 'group', 'size', 'month', 'day', 'year', 'filename')
p = regex.compile(r'(?=[d-](?:[r-][w-][x-]){3})', regex.V1)
res = [dict(zip(fields, x.split(None, 9))) for x in p.split(test)[1:]]

Upvotes: 0

Matthew

Reputation: 7590

Regular expression quantifiers are by default "greedy" which means that they will "eat" as much as possible.

[\s\w\.\-]+

means to find at least one AND AS MANY AS POSSIBLE of whitespace, word, dot, or dash characters. The look ahead prevents it from eating the entire input (actually the regex engine will eat the entire input and then start backing off as needed), which means that it eats each file specification line, except for the last (which the look ahead insists must be left).

Adding a ? after a quantifier (*?, +?, ??, and so on) makes the quantifier "lazy" or "reluctant". This changes the meaning of "+" from "match at least one and as many as possible" to "match at least one and no more than necessary".

Therefore changing that last + to a +? should fix your problem.

The problem wasn't with the look ahead, which works just fine, but with the last subexpression before it.

EDIT:

Even with this change, your regular expression will not parse that last file specification line. This is because the regular expressions INSISTS that there must be a permission spec after the filename. To fix this, we must allow that look ahead to not match (but require it to match at everything BUT the last specification). Making the following change will fix that

ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
                      '(?P<links>[0-9]{1,8})[\s]{1,20}'
                      '(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
                      '(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
                      '(?P<size>[0-9]{1,16})[\s]{1,20}'
                      '(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
                      '(?P<date>[0-9]{1,2})[\s]{1,20}'
                      '(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
                      '(?P<filename>[\s\w\.\-]+?)(?=(?:(?:[drwx\-]{10})|$))')

What I have done here (besides making that last + lazy) is to make the lookahead check two possibilities - either a permission specification OR an end of string. The ?: are to prevent those parentheses from capturing (otherwise you will end up with undesired extra data in your matches).

Upvotes: 2

Ferit

Reputation: 9657

Fixed your last line, filename group was not working. See fixed regex and the demo below:

(?P<permissions>[d-][rwx-]{9})[\s]{1,20}
                      (?P<links>[0-9]{1,8})[\s]{1,20}
                      (?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}
                      (?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}
                      (?P<size>[0-9]{1,16})[\s]{1,20}
                      (?P<month>[A-Za-z]{0,3})[\s]{1,20}
                      (?P<date>[0-9]{1,2})[\s]{1,20}
                      (?P<timeyear>[0-9:]{4,5})[\s]{1,20}
                      (?P<filename>[\w\-]+.\w+)

Demo here:

Upvotes: 0

anubhava

Reputation: 785156

You should make sub-pattern before lookahead non-greedy. Further your regex can be shortened a bit like this:

(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>\d{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>\d{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)

Or using compile:

from re import compile

ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})\s{1,20}'
   '(?P<links>\d{1,8})\s{1,20}'
   '(?P<owner>[\w-]{1,16})\s{1,20}'
   '(?P<group>[\w-]{1,16})\s{1,20}'
   '(?P<size>\d{1,16})\s{1,20}'
   '(?P<month>[A-Za-z]{0,3})\s{1,20}'
   '(?P<date>\d{1,2})\s{1,20}'
   '(?P<timeyear>[\d:]{4,5})\s{1,20}'
   '(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')

RegEx Demo

Code:

import re
p = re.compile(ur'(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>[0-9]{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>[0-9]{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')
test_str = u"-rw-r--r--   1 owner    group        75148624 Jan  6  2015 somename.csv-rw-r--r--   1 owner    group       223259072 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group         4041411 Jun  5  2015 somename-adjusted.csv-rw-r--r--   1 owner    group         2879228 May 13  2015 somename.csv-rw-r--r--   1 owner    group        11832668 Feb 13  2015 somename.csv-rw-r--r--   1 owner    group         1510522 Feb 19  2015 somename.csv-rw-r--r--   1 owner    group         2826664 Feb 25  2015 somename.csv-rw-r--r--   1 owner    group          582985 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group          212427 Feb 26  2015 somename.csv-rw-r--r--   1 owner    group         3015592 Feb 27  2015 somename.csv-rw-r--r--   1 owner    group          103576 Feb 27  2015 somename-corrected.csv"

re.findall(p, test_str)

Upvotes: 2

Python Regular expression Lookahead overshooting pattern

Answers (4)

Related Questions