Reputation: 13816
Having the following string:
commit a8c11fcee68881dfb86095aa36290fb304047cf1
log size 110
Author: XXXXXX XXXXXXXX <[email protected]>
Date: Tue, 10 Apr 2012 11:19:44 +0300
First commit
3 0 README.MD
How can I use the value 110
in the definition of the grammar to match the rest of the things? The "log size" includes the fields (here: Author
and Date
, but there could be any number of fields) and the actual message.
The last line is not part of the "log message".
What I want to get are the values of commit
, the dictionary with metadata like Author
and Date
, and the actual log message, here "First commit".
The thing is, log size
tells me how long this message is, but this includes the fields Author
and Date
as well.
110
being the size of this string:
Author: XXXXXX XXXXXXXX <[email protected]>
Date: Tue, 10 Apr 2012 11:19:44 +0300
First commit
Upvotes: 2
Views: 114
Reputation: 63762
You tagged your question with the "pyparsing" tag, so here is how you might use pyparsing to address it. Pyparsing includes a helper method called countedArray
that does pretty much what you want. You can think of:
countedArray(something)
as a short-cut for:
integer + something*(whatever was parsed as the integer)
and will return the parsed data as the list of somethings.
source = """commit a8c11fcee68881dfb86095aa36290fb304047cf1
log size 110
Author: XXXXXX XXXXXXXX <[email protected]>
Date: Tue, 10 Apr 2012 11:19:44 +0300
First commit<not in the message, more than 110 chars>
3 0 README.MD
"""
from pyparsing import *
any_char = Word(printables+" \n",exact=1).leaveWhitespace()
log_message = countedArray(any_char)
# we want a string, not an array of single characters
log_message.setParseAction(lambda t: ''.join(t[0]))
entry = "commit" + Word(hexnums)('id') + "log size" + log_message
msg = entry.parseString(source)[-1]
print (msg)
Gives:
Author: XXXXXX XXXXXXXX <[email protected]>
Date: Tue, 10 Apr 2012 11:19:44 +0300
First commit
and you can see that we have read up to, but not including, the "not in the message..." part (which I added to your source string to show that countedArray
correctly stops at the 110th character). I added a parse action, which pyparsing will run as a parse-time callback when the expression is matched. The matched tokens are passed to the parse action, and if the action returns a value, that value replaces the parsed tokens in the output. Here we use a simple lambda to take the first token (the array of characters), and join them into a single string.
But you also said that you wanted to extract the fields 'Author' and 'Date'. They are actually part of the stuff that was extracted in the log_message, so you'll have to pass that string to another expression. Fortunately, you can do that kind of thing in a second parse action.
In this second-phase parsing, I've decided to create a parser that will take any keyed values of the form:
some key: the value of that key up to the end of the line
in case 'Author' and 'Date' are just examples of any number of keys that you might find in your source text. Pyparsing also has named results, similar to named groups in regular expressions. Normally when the names are known, you can just tack on the name as shown above with your commit id. But in your log message, the actual names are parsed out of the input itself, so for this, pyparsing has a Dict
class. The Dict
class will take a parsed set of groups, and decorate that data with names for each group, taking the first element of the group as the name, and the remainder of the group as the value. So we want names for each keyed value shown, where everything up to the ':' is going to be the name, skip over the colon and any leading spaces, and then take the rest of the line as the value.
COLON = Suppress(':')
keyed_value = Group(Word(printables+' ',excludeChars=':') + COLON + empty + restOfLine)
keyed_entries = Dict(ZeroOrMore(keyed_value))
We still want that log message too, easiest to use a pyparsing SkipTo
to just take everything else up to the end of the string:
everything_else_up_to_the_end_of_the_string = SkipTo(StringEnd())
Here is the grammar for your log message body:
log_message_body = keyed_entries +
everything_else_up_to_the_end_of_the_string('message')
We already have a parse action to combine all the parsed characters in the log_message into a single string, but it is possible to chain multiple parse actions onto a single expression. We'll add a second parse action, which will parse the parsed log_message using the log_message_body grammar:
def parseMessage(tokens):
return log_message_body.parseString(tokens[0])
log_message.addParseAction(parseMessage)
Now we can run our full parser again, and this time, dump out the results and their names. The named results can be accessed just as if they were attributes of an object (or you can use dict notation if you prefer):
log_entry = entry.parseString(source)
print (log_entry.id)
print (log_entry['message'])
print (log_entry.dump())
Gives:
a8c11fcee68881dfb86095aa36290fb304047cf1
First commit
['commit', 'a8c11fcee68881dfb86095aa36290fb304047cf1', '
log size', ['Author', 'XXXXXX XXXXXXXX <[email protected]>'],
['Date', 'Tue, 10 Apr 2012 11:19:44 +0300'],
'First commit']
- Author: XXXXXX XXXXXXXX <[email protected]>
- Date: Tue, 10 Apr 2012 11:19:44 +0300
- id: a8c11fcee68881dfb86095aa36290fb304047cf1
- message: First commit
Upvotes: 0
Reputation: 27585
I had the same idea of algorithm as NPE.
But I pushed the use of regexes a litlle farther.
I 've extended the analyzed text with a second occurence of log message, taking care to put the right number of characters in the 'log size xxx\n' line
regex1 cuts each occurence in 4 groups. The third group contains the lines having the dictionary and the fourth group has the trailing lines after the dictionary-lines and before the other occurence.
import re
ss = """commit a8c11fcee68881dfb86095aa36290fb304047cf1
log size 110
Author: XXXXXX XXXXXXXX <[email protected]>
Date: Tue, 10 Apr 2012 11:19:44 +0300
First commit
3 0 README.MD
blablah bla
commit 12458777AFDRE1254
log size 170
Author: Jim Bluefish <[email protected]>
Date : Yesterday 21:45:01 +0800
A key with whitespace : A_stupid_value
Funny commit
From far from you
457 popo not_README.MD"""
n = 0
print ('------ DISPLAY OF THE TEXT ------\n'
' col 1: index of line,\n'
' col 2: number of chars in the line\n'
' col 3: total of the numbers of chars of lines\n'
' col 4: repr(line)\n')
for j,line in enumerate(ss.splitlines(1)):
n += len(line)
print '%2d %2d %3d %r' % (j,len(line),n,line)
print '=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-='
print '\n\n\n------ ANALYSER 2 OF THE TEXT ------'
regx1 = re.compile('^commit +(.+) *\r?\n'
'log size +(\d+) *\r?\n'
'((?:^ *.+?(?<! ) *: *.+(?<! ) *\r?\n)+)'
'((?:.*\r?\n(?!commit))+)',
re.MULTILINE)
regx2 = re.compile('^ *(.+?)(?<! ) *: *(.+)(?<! ) *\r?\n',
re.MULTILINE)
for mat in regx1.finditer(ss):
commit_value,logsize,dicolines,msg = mat.groups()
print ('\ncommit_value == %s\n'
'logsize == %s'
% (commit_value,logsize))
print 'dictionary :\n',dict(regx2.findall(dicolines))
actual_log_message = msg[0:int(logsize)-len(dicolines)].strip(' \r\n')
print 'actual_log_message ==',repr(actual_log_message)
result
------ DISPLAY OF THE TEXT ------
col 1: index of line,
col 2: number of chars in the line
col 3: total of the numbers of chars of lines
col 4: repr(line)
0 48 48 'commit a8c11fcee68881dfb86095aa36290fb304047cf1\n'
1 13 61 'log size 110\n'
2 52 113 'Author: XXXXXX XXXXXXXX <[email protected]>\n'
3 40 153 'Date: Tue, 10 Apr 2012 11:19:44 +0300\n'
4 1 154 '\n'
5 17 171 ' First commit\n'
6 26 197 '3 0 README.MD\n'
7 12 209 'blablah bla\n'
8 25 234 'commit 12458777AFDRE1254\n'
9 13 247 'log size 170\n'
10 45 292 ' Author: Jim Bluefish <[email protected]>\n'
11 36 328 'Date : Yesterday 21:45:01 +0800\n'
12 51 379 ' A key with whitespace : A_stupid_value \n'
13 1 380 '\n'
14 17 397 ' Funny commit\n'
15 20 417 ' From far from you\n'
16 33 450 '457 popo not_README.MD'
------ ANALYSER OF THE TEXT ------
commit_value == a8c11fcee68881dfb86095aa36290fb304047cf1
logsize == 110
dico :
{'Date': 'Tue, 10 Apr 2012 11:19:44 +0300', 'Author': 'XXXXXX XXXXXXXX <[email protected]>'}
actual_log_message == 'First commit'
commit_value == 12458777AFDRE1254
logsize == 170
dico :
{'Date': 'Yesterday 21:45:01 +0800', 'A key with whitespace': 'A_stupid_value', 'Author': 'Jim Bluefish <[email protected]>'}
actual_log_message == 'Funny commit\n From far from you'
Upvotes: 2
Reputation: 500773
I would do it in three stages:
The first two steps can be done as follows:
In [25]: s = """commit a8c11fcee68881dfb86095aa36290fb304047cf1
log size 110
Author: XXXXXX XXXXXXXX <[email protected]>
Date: Tue, 10 Apr 2012 11:19:44 +0300
First commit
3 0 README.MD
"""
In [26]: m = re.search('commit (.*)\nlog size (.*)\n', s)
In [27]: s[m.end():m.end()+int(m.group(2))]
Out[27]: 'Author: XXXXXX XXXXXXXX <[email protected]>\nDate: Tue, 10 Apr 2012 11:19:44 +0300\n\n First commit\n'
If the last string is called step2
, you can do the rest of the parsing as follows:
In [48]: meta, msg = step2.split('\n\n', 1)
In [49]: dict([map(str.strip, line.split(':', 1)) for line in meta.split('\n')])
Out[49]:
{'Author': 'XXXXXX XXXXXXXX <[email protected]>',
'Date': 'Tue, 10 Apr 2012 11:19:44 +0300'}
In [50]: msg
Out[50]: ' First commit\n'
Upvotes: 2