Wisco crew
Wisco crew

Reputation: 1367

Python re.split() misbehaving

I'm trying to parse a log generated by git log --numstat. It is in the format

commit 1234567890123456789012345678901234567890
Author: Joseph Shabadoo
Date: Sun Apr 21 14:34:36 2013 +0300

    fix the thing that was broken

4   0   foo.py
13  7   bar.py

commit aaaaaaaaaabbbbbbbbbbccccccccccdddddddddd
Author: Donald Dont
Date: Fri Apr 19 21:15:00 2012 +0300

    do some stuff

15  6   foo.py

... etc

I have it stored in a file, and I want to split it into commits for easier parsing. I am using re.split(), but can't seem to find the right regular expression for the job. I would think using

re.split('.*?\n\n.*?\n\n.*?\n\n', myfile.read())

would work, but I got all of the first commit and the first two lines of the second commit lumped together as well (commit aaaaa... and Author: ...). This is especially confusing, because there is not two successive newlines after the Author: line. What regular expression can split this up?

EDIT: apparently . doesn't match the newline character by default. The re needs to be compiled with the flag re.DOTALL.

Upvotes: 0

Views: 261

Answers (3)

Tim Pietzcker
Tim Pietzcker

Reputation: 336408

Let's visualize it:

RegexBuddy screenshot

Your regex requires two newlines at the end of the match, and there is only one after the line

4   0   foo.py

Upvotes: 3

kender
kender

Reputation: 87211

You could loop over the lines, matching the line with commit. Then you can store all the lines with current commit in an array.

allCommits = []
currentCommitLines = []
for line in lines:
    if re.match(r'^commit [0-9a-f]{40}') and currentCommitLines:
        allCommits.append(currentCommitLines)
        currentCommitLines = []
     currentCommitLines.append(line)

Then you would have the commits stored in the array and you could parse/do whatever you'd like with them later.

Upvotes: 1

John Zwinck
John Zwinck

Reputation: 249404

How about just matching the first line, which is quite consistent?

'commit [0-9a-f]{40}'

Upvotes: 4

Related Questions