Flavius
Flavius

Reputation: 13816

Using string length defined in the matched string

Having the following string:

commit a8c11fcee68881dfb86095aa36290fb304047cf1
log size 110
Author: XXXXXX XXXXXXXX <[email protected]>
Date:   Tue, 10 Apr 2012 11:19:44 +0300

    First commit

3       0       README.MD

How can I use the value 110 in the definition of the grammar to match the rest of the things? The "log size" includes the fields (here: Author and Date, but there could be any number of fields) and the actual message.

The last line is not part of the "log message".

What I want to get are the values of commit, the dictionary with metadata like Author and Date, and the actual log message, here "First commit".

The thing is, log size tells me how long this message is, but this includes the fields Author and Date as well.

110 being the size of this string:

Author: XXXXXX XXXXXXXX <[email protected]>
Date:   Tue, 10 Apr 2012 11:19:44 +0300

    First commit

Upvotes: 2

Views: 114

Answers (3)

PaulMcG
PaulMcG

Reputation: 63762

You tagged your question with the "pyparsing" tag, so here is how you might use pyparsing to address it. Pyparsing includes a helper method called countedArray that does pretty much what you want. You can think of:

countedArray(something)

as a short-cut for:

integer + something*(whatever was parsed as the integer)

and will return the parsed data as the list of somethings.

source = """commit a8c11fcee68881dfb86095aa36290fb304047cf1
log size 110
Author: XXXXXX XXXXXXXX <[email protected]>
Date:   Tue, 10 Apr 2012 11:19:44 +0300

    First commit<not in the message, more than 110 chars>

3       0       README.MD
"""

from pyparsing import *
any_char = Word(printables+" \n",exact=1).leaveWhitespace()

log_message = countedArray(any_char)
# we want a string, not an array of single characters
log_message.setParseAction(lambda t: ''.join(t[0]))

entry = "commit" + Word(hexnums)('id') + "log size" + log_message

msg = entry.parseString(source)[-1]
print (msg)

Gives:

Author: XXXXXX XXXXXXXX <[email protected]>
Date:   Tue, 10 Apr 2012 11:19:44 +0300

    First commit

and you can see that we have read up to, but not including, the "not in the message..." part (which I added to your source string to show that countedArray correctly stops at the 110th character). I added a parse action, which pyparsing will run as a parse-time callback when the expression is matched. The matched tokens are passed to the parse action, and if the action returns a value, that value replaces the parsed tokens in the output. Here we use a simple lambda to take the first token (the array of characters), and join them into a single string.

But you also said that you wanted to extract the fields 'Author' and 'Date'. They are actually part of the stuff that was extracted in the log_message, so you'll have to pass that string to another expression. Fortunately, you can do that kind of thing in a second parse action.

In this second-phase parsing, I've decided to create a parser that will take any keyed values of the form:

some key: the value of that key up to the end of the line

in case 'Author' and 'Date' are just examples of any number of keys that you might find in your source text. Pyparsing also has named results, similar to named groups in regular expressions. Normally when the names are known, you can just tack on the name as shown above with your commit id. But in your log message, the actual names are parsed out of the input itself, so for this, pyparsing has a Dict class. The Dict class will take a parsed set of groups, and decorate that data with names for each group, taking the first element of the group as the name, and the remainder of the group as the value. So we want names for each keyed value shown, where everything up to the ':' is going to be the name, skip over the colon and any leading spaces, and then take the rest of the line as the value.

COLON = Suppress(':')
keyed_value = Group(Word(printables+' ',excludeChars=':') + COLON + empty + restOfLine)
keyed_entries = Dict(ZeroOrMore(keyed_value))

We still want that log message too, easiest to use a pyparsing SkipTo to just take everything else up to the end of the string:

everything_else_up_to_the_end_of_the_string = SkipTo(StringEnd())

Here is the grammar for your log message body:

log_message_body = keyed_entries + 
                    everything_else_up_to_the_end_of_the_string('message')

We already have a parse action to combine all the parsed characters in the log_message into a single string, but it is possible to chain multiple parse actions onto a single expression. We'll add a second parse action, which will parse the parsed log_message using the log_message_body grammar:

def parseMessage(tokens):
    return log_message_body.parseString(tokens[0])
log_message.addParseAction(parseMessage)

Now we can run our full parser again, and this time, dump out the results and their names. The named results can be accessed just as if they were attributes of an object (or you can use dict notation if you prefer):

log_entry = entry.parseString(source)
print (log_entry.id)
print (log_entry['message'])
print (log_entry.dump())

Gives:

a8c11fcee68881dfb86095aa36290fb304047cf1
First commit
['commit', 'a8c11fcee68881dfb86095aa36290fb304047cf1', '
    log size', ['Author', 'XXXXXX XXXXXXXX <[email protected]>'], 
    ['Date', 'Tue, 10 Apr 2012 11:19:44 +0300'], 
    'First commit']
- Author: XXXXXX XXXXXXXX <[email protected]>
- Date: Tue, 10 Apr 2012 11:19:44 +0300
- id: a8c11fcee68881dfb86095aa36290fb304047cf1
- message: First commit

Upvotes: 0

eyquem
eyquem

Reputation: 27585

I had the same idea of algorithm as NPE.
But I pushed the use of regexes a litlle farther.

I 've extended the analyzed text with a second occurence of log message, taking care to put the right number of characters in the 'log size xxx\n' line

regex1 cuts each occurence in 4 groups. The third group contains the lines having the dictionary and the fourth group has the trailing lines after the dictionary-lines and before the other occurence.

import re

ss = """commit a8c11fcee68881dfb86095aa36290fb304047cf1
log size 110
Author: XXXXXX XXXXXXXX <[email protected]>
Date:   Tue, 10 Apr 2012 11:19:44 +0300

    First commit
3       0       README.MD
blablah bla
commit 12458777AFDRE1254
log size 170
   Author: Jim Bluefish <[email protected]>
Date   :   Yesterday 21:45:01 +0800
  A key with whitespace :       A_stupid_value    

    Funny commit
  From far from you
457      popo       not_README.MD"""

n = 0
print ('------ DISPLAY OF THE TEXT ------\n'
       ' col 1: index of line,\n'
       ' col 2: number of chars in the line\n'
       ' col 3: total of the numbers of chars of lines\n'
       ' col 4: repr(line)\n')
for j,line in enumerate(ss.splitlines(1)):
    n += len(line)
    print '%2d  %2d  %3d  %r' % (j,len(line),n,line)


print '=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-='
print '\n\n\n------ ANALYSER 2 OF THE TEXT ------'

regx1 = re.compile('^commit +(.+) *\r?\n'
                   'log size +(\d+) *\r?\n'
                   '((?:^ *.+?(?<! ) *: *.+(?<! ) *\r?\n)+)'
                   '((?:.*\r?\n(?!commit))+)',
                   re.MULTILINE)

regx2 = re.compile('^ *(.+?)(?<! ) *: *(.+)(?<! ) *\r?\n',
                   re.MULTILINE)

for mat in regx1.finditer(ss):

    commit_value,logsize,dicolines,msg = mat.groups()

    print ('\ncommit_value == %s\n'
           'logsize == %s'
           % (commit_value,logsize))

    print 'dictionary :\n',dict(regx2.findall(dicolines))

    actual_log_message = msg[0:int(logsize)-len(dicolines)].strip(' \r\n')
    print 'actual_log_message ==',repr(actual_log_message)

result

------ DISPLAY OF THE TEXT ------
 col 1: index of line,
 col 2: number of chars in the line
 col 3: total of the numbers of chars of lines
 col 4: repr(line)

 0  48   48  'commit a8c11fcee68881dfb86095aa36290fb304047cf1\n'
 1  13   61  'log size 110\n'
 2  52  113  'Author: XXXXXX XXXXXXXX <[email protected]>\n'
 3  40  153  'Date:   Tue, 10 Apr 2012 11:19:44 +0300\n'
 4   1  154  '\n'
 5  17  171  '    First commit\n'
 6  26  197  '3       0       README.MD\n'
 7  12  209  'blablah bla\n'
 8  25  234  'commit 12458777AFDRE1254\n'
 9  13  247  'log size 170\n'
10  45  292  '   Author: Jim Bluefish <[email protected]>\n'
11  36  328  'Date   :   Yesterday 21:45:01 +0800\n'
12  51  379  '  A key with whitespace :       A_stupid_value    \n'
13   1  380  '\n'
14  17  397  '    Funny commit\n'
15  20  417  '  From far from you\n'
16  33  450  '457      popo       not_README.MD'



------ ANALYSER OF THE TEXT ------

commit_value == a8c11fcee68881dfb86095aa36290fb304047cf1
logsize == 110
dico :
{'Date': 'Tue, 10 Apr 2012 11:19:44 +0300', 'Author': 'XXXXXX XXXXXXXX <[email protected]>'}
actual_log_message == 'First commit'


commit_value == 12458777AFDRE1254
logsize == 170
dico :
{'Date': 'Yesterday 21:45:01 +0800', 'A key with whitespace': 'A_stupid_value', 'Author': 'Jim Bluefish <[email protected]>'}
actual_log_message == 'Funny commit\n  From far from you'

Upvotes: 2

NPE
NPE

Reputation: 500773

I would do it in three stages:

  1. Use a regex to find each commit, and get its id and log size.
  2. Using the end of the match in step 1 and the log size, I'd slice the metadata+message out of the string.
  3. Parse the string from step 2 into a dictionary+message.

The first two steps can be done as follows:

In [25]: s = """commit a8c11fcee68881dfb86095aa36290fb304047cf1
log size 110
Author: XXXXXX XXXXXXXX <[email protected]>
Date:   Tue, 10 Apr 2012 11:19:44 +0300

    First commit
3       0       README.MD
"""

In [26]: m = re.search('commit (.*)\nlog size (.*)\n', s)

In [27]: s[m.end():m.end()+int(m.group(2))]
Out[27]: 'Author: XXXXXX XXXXXXXX <[email protected]>\nDate:   Tue, 10 Apr 2012 11:19:44 +0300\n\n    First commit\n'

If the last string is called step2, you can do the rest of the parsing as follows:

In [48]: meta, msg = step2.split('\n\n', 1)

In [49]: dict([map(str.strip, line.split(':', 1)) for line in meta.split('\n')])
Out[49]: 
{'Author': 'XXXXXX XXXXXXXX <[email protected]>',
 'Date': 'Tue, 10 Apr 2012 11:19:44 +0300'}

In [50]: msg
Out[50]: '    First commit\n'

Upvotes: 2

Related Questions