Reputation: 1493
I have a list containing a set of the history of a file. I need to separate each element in the list into several columns and save it to CSV
file.
The columns I need are commit_id, filename, committer, date, time, line_number, code
.
Suppose, this is my list:
my_list = [
'f5213095324 master/ActiveMasterManager.java (Michael Stack 2010-08-31 23:51:44 +0000 1) /**',
'f5213095324 master/ActiveMasterManager.java (Michael Stack 2010-08-31 23:51:44 +0000 2) *',
'f5213095324 master/ActiveMasterManager.java (Michael Stack 2010-08-31 23:51:44 +0000 3) * Licensed to the Apache Software Foundation (ASF) under one',
'f5213095324 master/ActiveMasterManager.java (Michael Stack 2010-08-31 23:51:44 +0000 4) * or more contributor license agreements.',
...
'b5cf8748198 master/ActiveMasterManager.java (Michael Stack 2012-09-27 05:40:09 +0000 160) if (ZKUtil.checkExists(this.watcher, backupZNode) != -1) {'
]
The desired csv
output:
commit_id | filename | committer | date | time | line_number | code
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
f5213095324 | master/ActiveMasterManager.java | Michael Stack | 2010-08-31 | 23:51:44 | 1 | /**
f5213095324 | master/ActiveMasterManager.java | Michael Stack | 2010-08-31 | 23:51:44 | 2 | *
f5213095324 | master/ActiveMasterManager.java | Michael Stack | 2010-08-31 | 23:51:44 | 3 | * Licensed to the Apache Software Foundation (ASF) under one
f5213095324 | master/ActiveMasterManager.java | Michael Stack | 2010-08-31 | 23:51:44 | 4 | * or more contributor license agreements.
........
b5cf8748198 | master/ActiveMasterManager.java | Michael Stack | 2012-09-27 | 05:40:09 | 160 | if (ZKUtil.checkExists(this.watcher, backupZNode) != -1) {
I tried using this code:
pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.+)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).+(?P<line_number>\b\d+\b)\)\s+(?P<code>[^"]*)')
with open('somefile.csv', 'w+', newline='') as f:
writer = csv.writer(f)
writer.writerow(['commit_id', 'filename', 'committer', 'date', 'time', 'line_number', 'code'])
for line in my_list:
writer.writerow([field.strip() for field in pattern.match(line).groups()])
In general, the code works.
But for line number = 160
, it's written -1
in column line_number
and is written only {
in column code
.
Is there something missing in the regex?
Upvotes: 3
Views: 3522
Reputation: 2887
The main problem with your pattern is usage of .+
. If you replace it with .*?
you will not only solve the issue with line number but also with catching whitespaces after committer name:
pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.*?)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).*?(?P<line_number>\b\d+\b)\)\s+(?P<code>[^"]*)')
https://regex101.com/r/f7zjpA/2
EDIT:
You didn't mention that you want to keep indentations and your code didn't look like you actually want it. Whitespaces/indentations before the code are removed not only because of the regex pattern. There are two things:
in regex pattern you used \s+
before code
group, which excludes all the whitespaces/indentations. If you want to keep them, replace \s+
with \s
which will catch only first one instead all of them:
pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.*?)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).*?(?P<line_number>\b\d+\b)\)\s(?P<code>[^"]*)')
in the for loop you use field.strip()
which removes all whitespaces which are present at the beginning and the end of the string. Modifying the pattern and exchanging:
writer.writerow([field.strip() for field in pattern.match(line).groups()])
with:
writer.writerow(pattern.match(line).groups())
will result in keeping indentations where they belong.
Upvotes: 1
Reputation: 11550
Not exactly you are looking for but this can be useful.
import re
for row in my_list:
print([x.strip() for x in re.split(r"(?![)])\s+(?![(])", row)])
out:
['f5213095324', 'master/ActiveMasterManager.java', '(Michael', 'Stack', '2010-08-31', '23:51:44', '+0000', '1)', '/**']
['f5213095324', 'master/ActiveMasterManager.java', '(Michael', 'Stack', '2010-08-31', '23:51:44', '+0000', '2)', '*']
...
Upvotes: 0
Reputation: 43
I fixed regex. This should work:
pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.+)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).+?(?P<line_number>\b\d+\b)\)\s+(?P<code>[^"]*)')
I added a question mark to use Lazy matching ".+" => ".+?"
https://regex101.com/r/GQGLvy/1
Upvotes: 1