Reputation: 1367
I'm trying to parse a log generated by git log --numstat
. It is in the format
commit 1234567890123456789012345678901234567890
Author: Joseph Shabadoo
Date: Sun Apr 21 14:34:36 2013 +0300
fix the thing that was broken
4 0 foo.py
13 7 bar.py
commit aaaaaaaaaabbbbbbbbbbccccccccccdddddddddd
Author: Donald Dont
Date: Fri Apr 19 21:15:00 2012 +0300
do some stuff
15 6 foo.py
... etc
I have it stored in a file, and I want to split it into commits for easier parsing. I am using re.split()
, but can't seem to find the right regular expression for the job. I would think using
re.split('.*?\n\n.*?\n\n.*?\n\n', myfile.read())
would work, but I got all of the first commit and the first two lines of the second commit lumped together as well (commit aaaaa...
and Author: ...
). This is especially confusing, because there is not two successive newlines after the Author:
line. What regular expression can split this up?
EDIT: apparently .
doesn't match the newline character by default. The re needs to be compiled with the flag re.DOTALL
.
Upvotes: 0
Views: 261
Reputation: 336408
Let's visualize it:
Your regex requires two newlines at the end of the match, and there is only one after the line
4 0 foo.py
Upvotes: 3
Reputation: 87211
You could loop over the lines, matching the line with commit
. Then you can store all the lines with current commit in an array.
allCommits = []
currentCommitLines = []
for line in lines:
if re.match(r'^commit [0-9a-f]{40}') and currentCommitLines:
allCommits.append(currentCommitLines)
currentCommitLines = []
currentCommitLines.append(line)
Then you would have the commits stored in the array and you could parse/do whatever you'd like with them later.
Upvotes: 1
Reputation: 249404
How about just matching the first line, which is quite consistent?
'commit [0-9a-f]{40}'
Upvotes: 4