Reputation: 537
This is similar to a question I've asked before.
I have written a pyparsing grammar logparser
for a text file which contains multiple logs. A log documents every function call and every function completion. The underlying process is multithreaded, so it is possible that a slow function A
is called, then a fast function B
is called and finishes almost immediately, and after that function A
finishes and gives us its return value. Due to this, the log file is very difficult to read by hand because the call information and return value information of one function can be thousands of lines apart.
My parser is able to parse the function calls (from now on called input_blocks
) and their return values (from now on called output_blocks
). My parse results (logparser.searchString(logfile)
) look like this:
[0]: # first log
- input_blocks:
[0]:
- func_name: 'Foo'
- parameters: ...
- thread: '123'
- timestamp_in: '12:01'
[1]:
- func_name: 'Bar'
- parameters: ...
- thread: '456'
- timestamp_in: '12:02'
- output_blocks:
[0]:
- func_name: 'Bar'
- func_time: '1'
- parameters: ...
- thread: '456'
- timestamp_out: '12:03'
[1]:
- func_name: 'Foo'
- func_time: '3'
- parameters: ...
- thread: '123'
- timestamp_out: '12:04'
[1]: # second log
- input_blocks:
...
- output_blocks:
...
... # n-th log
I want to solve the problem that input and output information of one function call are separated. So I want to put an input_block
and the corresponding output_block
into a function_block
. My final parse results should look like this:
[0]: # first log
- function_blocks:
[0]:
- input_block:
- func_name: 'Foo'
- parameters: ...
- thread: '123'
- timestamp_in: '12:01'
- output_block:
- func_name: 'Foo'
- func_time: '3'
- parameters: ...
- thread: '123'
- timestamp_out: '12:04'
[1]:
- input_block:
- func_name: 'Bar'
- parameters: ...
- thread: '456'
- timestamp_in: '12:02'
- output_block:
- func_name: 'Bar'
- func_time: '1'
- parameters: ...
- thread: '456'
- timestamp_out: '12:03'
[1]: # second log
- function_blocks:
[0]: ...
[1]: ...
... # n-th log
To achieve this, I define a function rearrange
which iterates through input_blocks
and output_blocks
and checks whether func_name
, thread
, and the timestamps match. However, moving the matching blocks into one function_block
is the part I am missing. I then set this function as parse action for the log grammar: logparser.setParseAction(rearrange)
def rearrange(log_token):
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if (output_block.func_name == input_block.func_name
and output_block.thread == input_block.thread
and check_timestamp(output_block.timestamp_out,
output_block.func_time,
input_block.timestamp_in):
# output_block and input_block match -> put them in a function_block
# modify log_token
return log_token
My question is: How do I put the matching output_block
and input_block
in a function_block
in a way that I still enjoy the easy access methods of pyparsing.ParseResults
?
My idea looks like this:
def rearrange(log_token):
# define a new ParseResults object in which I store matching input & output blocks
function_blocks = pp.ParseResults(name='function_blocks')
# find matching blocks
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if (output_block.func_name == input_block.func_name
and output_block.thread == input_block.thread
and check_timestamp(output_block.timestamp_out,
output_block.func_time,
input_block.timestamp_in):
# output_block and input_block match -> put them in a function_block
function_blocks.append(input_block.pop() + output_block.pop()) # this addition causes a maximum recursion error?
log_token.append(function_blocks)
return log_token
This doesn't work though. The addition causes a maximum recursion error and the .pop()
doesn't work as expected. It doesn't pop the whole block, it just pops the last entry in that block. Also, it doesn't actually remove that entry either, it justs removes it from the list, but it's still accessible by its results name.
It's also possible that some of theinput_blocks
don't have a corresponding output_block
(for example if the process crashes before all functions can finish). So my parse results should have the attributes input_blocks
, output_blocks
(for the spare blocks), and function_blocks
(for the matching blocks).
Thanks for your help!
EDIT:
I made a simpler example to show my problem. Also, I experimented around and have a solution which kind of works but is a bit messy. I must admit there was a lot of trial-and-error included because I neither found documentation on nor can make sense of the inner workings of ParseResults
and how to properly create my own nested ParseResults
-structure.
from pyparsing import *
def main():
log_data = '''\
Func1_in
Func2_in
Func2_out
Func1_out
Func3_in'''
ParserElement.inlineLiteralsUsing(Suppress)
input_block = Group(Word(alphanums)('func_name') + '_in').setResultsName('input_blocks', listAllMatches=True)
output_block = Group(Word(alphanums)('func_name') +'_out').setResultsName('output_blocks', listAllMatches=True)
log = OneOrMore(input_block | output_block)
parse_results = log.parseString(log_data)
print('***** before rearranging *****')
print(parse_results.dump())
parse_results = rearrange(parse_results)
print('***** after rearranging *****')
print(parse_results.dump())
def rearrange(log_token):
function_blocks = list()
for input_block in log_token.input_blocks:
for output_block in log_token.output_blocks:
if input_block.func_name == output_block.func_name:
# found two matching blocks! now put them in a function_block
# and delete them from their original positions in log_token
# I have to do both __setitem__ and .append so it shows up in the dict and in the list
# and .copy() is necessary because I delete the original objects later
tmp_function_block = ParseResults()
tmp_function_block.__setitem__('input', input_block.copy())
tmp_function_block.append(input_block.copy())
tmp_function_block.__setitem__('output', output_block.copy())
tmp_function_block.append(output_block.copy())
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'], function_block['output'] # remove duplicate data
function_blocks.append(function_block)
# delete from original position in log_token
input_block.clear()
output_block.clear()
log_token.__setitem__('function_blocks', sum(function_blocks))
return log_token
if __name__ == '__main__':
main()
Output:
***** before rearranging *****
[['Func1'], ['Func2'], ['Func2'], ['Func1'], ['Func3']]
- input_blocks: [['Func1'], ['Func2'], ['Func3']]
[0]:
['Func1']
- func_name: 'Func1'
[1]:
['Func2']
- func_name: 'Func2'
[2]:
['Func3']
- func_name: 'Func3'
- output_blocks: [['Func2'], ['Func1']]
[0]:
['Func2']
- func_name: 'Func2'
[1]:
['Func1']
- func_name: 'Func1'
***** after rearranging *****
[[], [], [], [], ['Func3']]
- function_blocks: [['Func1'], ['Func1'], ['Func2'], ['Func2'], [], []] # why is this duplicated? I just want the inner function_blocks!
- function_blocks: [[['Func1'], ['Func1']], [['Func2'], ['Func2']], [[], []]]
[0]:
[['Func1'], ['Func1']]
- input: ['Func1']
- func_name: 'Func1'
- output: ['Func1']
- func_name: 'Func1'
[1]:
[['Func2'], ['Func2']]
- input: ['Func2']
- func_name: 'Func2'
- output: ['Func2']
- func_name: 'Func2'
[2]: # where does this come from?
[[], []]
- input: []
- output: []
- input_blocks: [[], [], ['Func3']]
[0]: # how do I delete these indexes?
[] # I think I only cleared their contents
[1]:
[]
[2]:
['Func3']
- func_name: 'Func3'
- output_blocks: [[], []]
[0]:
[]
[1]:
[]
Upvotes: 3
Views: 883
Reputation: 3529
another demo for pyparsing setParseAction:
remove whitespace before the first value, preserve whitespace between values
i tried to solve this with pp.Optional(pp.White(' \t')).suppress()
but then i got a = ["b=1"]
(parser did not stop at end-of-line)
def lstrip_first_value(src, loc, token):
"remove whitespace before first value"
# based on https://stackoverflow.com/a/51335710/10440128
if token == []:
return token
# update the values
copy = token[:]
copy[0] = copy[0].lstrip()
if copy[0] == "" and len(copy) > 1:
copy = copy[1:]
# update the token
token.clear()
token.extend(copy)
token["value"] = copy
return token
Values = (
pp.OneOrMore(Value.leaveWhitespace())
| pp.Empty().setParseAction(pp.replaceWith(""))
)("value").setParseAction(lstrip_first_value)
Value = pp.Combine(
pp.QuotedString(quoteChar='"', escChar="\\")
| pp.White(' \t') # parse whitespace to separate token
)
inputs
a=
b=2
a =
b=2
the values of a should always be [""]
Upvotes: 0
Reputation: 63709
This version of rearrange
addresses most of the issues I see in your example:
def rearrange(log_token):
function_blocks = list()
for input_block in log_token.input_blocks:
# look for match among output blocks that have not been cleared
for output_block in filter(None, log_token.output_blocks):
if input_block.func_name == output_block.func_name:
# found two matching blocks! now put them in a function_block
# and clear them from in their original positions in log_token
# create rearranged block, first with a list of the two blocks
# instead of append()'ing, just initialize with a list containing
# the two block copies
tmp_function_block = ParseResults([input_block.copy(), output_block.copy()])
# now assign the blocks by name
# x.__setitem__(key, value) is the same as x[key] = value
tmp_function_block['input'] = tmp_function_block[0]
tmp_function_block['output'] = tmp_function_block[1]
# wrap that all in another ParseResults, as if we had matched a Group
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'], function_block['output'] # remove duplicate name references
function_blocks.append(function_block)
# clear blocks in their original positions in log_token, so they won't be matched any more
input_block.clear()
output_block.clear()
# match found, no need to keep going looking for a matching output block
break
# find all input blocks that weren't cleared (had matching output blocks) and append as input-only blocks
for input_block in filter(None, log_token.input_blocks):
# no matching output for this input
tmp_function_block = ParseResults([input_block.copy()])
tmp_function_block['input'] = tmp_function_block[0]
function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
del function_block['input'] # remove duplicate data
function_blocks.append(function_block)
input_block.clear()
# clean out log_token, and reload with rearranged function blocks
log_token.clear()
log_token.extend(function_blocks)
log_token['function_blocks'] = sum(function_blocks)
return log_token
And since this takes the input token and returns the rearranged tokens, you can make it a parse action as-is:
# trailing '*' on the results name is equivalent to listAllMatches=True
input_block = Group(Word(alphanums)('func_name') + '_in')('input_blocks*')
output_block = Group(Word(alphanums)('func_name') +'_out')('output_blocks*')
log = OneOrMore(input_block | output_block)
log.addParseAction(rearrange)
Since rearrange
updated log_token
in place, if you make it a parse action, the ending return
statement would be unnecessary.
It is interesting how you were able to update the list in-place by clearing those blocks that you had found matches for - very clever.
Generally, the assembly of tokens into ParseResults
is an internal function, so the docs are light on this topic. I was just looking through the module docs and I don't really see a good home for this topic.
Upvotes: 2