BenH
BenH

Reputation: 720

Having trouble with re and matching groups

I'm just pulling my hair out with python regexp's.

I have a string which contains the multi-line output from an os command.

One such line will contain a string like this following:

2015/04/13.16:26:07 156.0 GB of instance data copied, dev_iosecs 1887, dev_iorate 88.8 MB/s

I am wanting to parse out "156.0 GB" into two matching groups. This field can also contain TB, MB, KB and possibly even just byes but for now I just wanna focus on TB, MB and KB and I'll deal with the potential scenario where it's just bytes later if it arises.

    if self.type == "cpinstance":
        if re.search("of instance data copied", line):
            m = re.match("(?P<datasize>\d[.][\d]) (?P<units>TB|GB|MB|KB) of instance data copied", line)
            print m.group('datasize'), m.group('units')
            if m.group('units') == "GB":
                print "MATCH!!!!!"

I've tried scores of permutations of regexps and can't for the life of me get m.group to ever work.

Traceback (most recent call last):
  File "./listInstances.py", line 187, in <module>
    tscript = OSBTranscript(image.jobid)
  File "/devel/REPO/PYLIB/osb.py", line 833, in __init__
    print m.group('datasize'), m.group('units')
AttributeError: 'NoneType' object has no attribute 'group'

I'm sure it's something stupid staring me right in the face but currently eluding me. =p

Thanks for any help.

Upvotes: 0

Views: 36

Answers (2)

hwnd
hwnd

Reputation: 70732

re.match() matches from the beginning of the string, you need to use re.search() which looks for the first location where the regular expression pattern produces a match ...

>>> import re
>>> s = '2015/04/13.16:26:07 156.0 GB of instance data copied, dev_iosecs 1887, dev_iorate 88.8 MB/s'
>>> m = re.search(r'(?P<datasize>\d+(?:\.\d+)?) (?P<units>[TGMK]B)', s)
>>> print m.group('datasize'), m.group('units')

156.0 GB

Note: Your regular expression inside of the <datasize> named group was not matching as expected. You needed a quantifier to catch the entire pattern, so I modified it to allow for that as well.

Upvotes: 1

Kevin
Kevin

Reputation: 76194

match always starts at the beginning of the line, so it will fail when it sees the date and time section. Try using search instead of match.

import re

line = "2015/04/13.16:26:07 156.0 GB of instance data copied, dev_iosecs 1887, dev_iorate 88.8 MB/s"

if re.search("of instance data copied", line):
    m = re.search("(?P<datasize>\d[.][\d]) (?P<units>TB|GB|MB|KB) of instance data copied", line)
    print m.group('datasize'), m.group('units')
    if m.group('units') == "GB":
        print "MATCH!!!!!"

Result:

6.0 GB
MATCH!!!!!

Good start, but it only matches one digit before the decimal point. try putting a star after your \d. (or perhaps a plus, depending on whether you want to find numbers like ".5".)

import re

line = "2015/04/13.16:26:07 156.0 GB of instance data copied, dev_iosecs 1887, dev_iorate 88.8 MB/s"

if re.search("of instance data copied", line):
    m = re.search("(?P<datasize>\d*[.][\d]) (?P<units>TB|GB|MB|KB) of instance data copied", line)
    print m.group('datasize'), m.group('units')
    if m.group('units') == "GB":
        print "MATCH!!!!!"

Result:

156.0 GB
MATCH!!!!!

Upvotes: 2

Related Questions