Reputation: 720
I'm just pulling my hair out with python regexp's.
I have a string which contains the multi-line output from an os command.
One such line will contain a string like this following:
2015/04/13.16:26:07 156.0 GB of instance data copied, dev_iosecs 1887, dev_iorate 88.8 MB/s
I am wanting to parse out "156.0 GB" into two matching groups. This field can also contain TB, MB, KB and possibly even just byes but for now I just wanna focus on TB, MB and KB and I'll deal with the potential scenario where it's just bytes later if it arises.
if self.type == "cpinstance":
if re.search("of instance data copied", line):
m = re.match("(?P<datasize>\d[.][\d]) (?P<units>TB|GB|MB|KB) of instance data copied", line)
print m.group('datasize'), m.group('units')
if m.group('units') == "GB":
print "MATCH!!!!!"
I've tried scores of permutations of regexps and can't for the life of me get m.group to ever work.
Traceback (most recent call last):
File "./listInstances.py", line 187, in <module>
tscript = OSBTranscript(image.jobid)
File "/devel/REPO/PYLIB/osb.py", line 833, in __init__
print m.group('datasize'), m.group('units')
AttributeError: 'NoneType' object has no attribute 'group'
I'm sure it's something stupid staring me right in the face but currently eluding me. =p
Thanks for any help.
Upvotes: 0
Views: 36
Reputation: 70732
re.match()
matches from the beginning of the string, you need to use re.search()
which looks for the first location where the regular expression pattern produces a match ...
>>> import re
>>> s = '2015/04/13.16:26:07 156.0 GB of instance data copied, dev_iosecs 1887, dev_iorate 88.8 MB/s'
>>> m = re.search(r'(?P<datasize>\d+(?:\.\d+)?) (?P<units>[TGMK]B)', s)
>>> print m.group('datasize'), m.group('units')
156.0 GB
Note: Your regular expression inside of the <datasize>
named group was not matching as expected. You needed a quantifier to catch the entire pattern, so I modified it to allow for that as well.
Upvotes: 1
Reputation: 76194
match
always starts at the beginning of the line, so it will fail when it sees the date and time section. Try using search
instead of match
.
import re
line = "2015/04/13.16:26:07 156.0 GB of instance data copied, dev_iosecs 1887, dev_iorate 88.8 MB/s"
if re.search("of instance data copied", line):
m = re.search("(?P<datasize>\d[.][\d]) (?P<units>TB|GB|MB|KB) of instance data copied", line)
print m.group('datasize'), m.group('units')
if m.group('units') == "GB":
print "MATCH!!!!!"
Result:
6.0 GB
MATCH!!!!!
Good start, but it only matches one digit before the decimal point. try putting a star after your \d
. (or perhaps a plus, depending on whether you want to find numbers like ".5".)
import re
line = "2015/04/13.16:26:07 156.0 GB of instance data copied, dev_iosecs 1887, dev_iorate 88.8 MB/s"
if re.search("of instance data copied", line):
m = re.search("(?P<datasize>\d*[.][\d]) (?P<units>TB|GB|MB|KB) of instance data copied", line)
print m.group('datasize'), m.group('units')
if m.group('units') == "GB":
print "MATCH!!!!!"
Result:
156.0 GB
MATCH!!!!!
Upvotes: 2