Giuseppe Cianci
Giuseppe Cianci

Reputation: 537

Editing pyparsing parse results

This is similar to a question I've asked before.

I have written a pyparsing grammar logparser for a text file which contains multiple logs. A log documents every function call and every function completion. The underlying process is multithreaded, so it is possible that a slow function A is called, then a fast function B is called and finishes almost immediately, and after that function A finishes and gives us its return value. Due to this, the log file is very difficult to read by hand because the call information and return value information of one function can be thousands of lines apart.

My parser is able to parse the function calls (from now on called input_blocks) and their return values (from now on called output_blocks). My parse results (logparser.searchString(logfile)) look like this:

[0]:                            # first log
  - input_blocks:
    [0]:
      - func_name: 'Foo'
      - parameters: ...
      - thread: '123'
      - timestamp_in: '12:01'
    [1]:
      - func_name: 'Bar'
      - parameters: ...
      - thread: '456'
      - timestamp_in: '12:02'
  - output_blocks:
    [0]:
      - func_name: 'Bar'
      - func_time: '1'
      - parameters: ...
      - thread: '456'
      - timestamp_out: '12:03'
    [1]:
      - func_name: 'Foo'
      - func_time: '3'
      - parameters: ...
      - thread: '123'
      - timestamp_out: '12:04'
[1]:                            # second log
    - input_blocks:
    ...

    - output_blocks:
    ...
...                             # n-th log

I want to solve the problem that input and output information of one function call are separated. So I want to put an input_block and the corresponding output_block into a function_block. My final parse results should look like this:

[0]:                            # first log
  - function_blocks:
    [0]:
        - input_block:
            - func_name: 'Foo'
            - parameters: ...
            - thread: '123'
            - timestamp_in: '12:01'
        - output_block:
            - func_name: 'Foo'
            - func_time: '3'
            - parameters: ...
            - thread: '123'
            - timestamp_out: '12:04'
    [1]:
        - input_block:
            - func_name: 'Bar'
            - parameters: ...
            - thread: '456'
            - timestamp_in: '12:02'
        - output_block:
            - func_name: 'Bar'
            - func_time: '1'
            - parameters: ...
            - thread: '456'
            - timestamp_out: '12:03'
[1]:                            # second log
    - function_blocks:
    [0]: ...
    [1]: ...
...                             # n-th log

To achieve this, I define a function rearrange which iterates through input_blocks and output_blocks and checks whether func_name, thread, and the timestamps match. However, moving the matching blocks into one function_block is the part I am missing. I then set this function as parse action for the log grammar: logparser.setParseAction(rearrange)

def rearrange(log_token):
    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if (output_block.func_name == input_block.func_name
                and output_block.thread == input_block.thread
                and check_timestamp(output_block.timestamp_out,
                                    output_block.func_time,
                                    input_block.timestamp_in):
                # output_block and input_block match -> put them in a function_block
                # modify log_token
    return log_token

My question is: How do I put the matching output_block and input_block in a function_block in a way that I still enjoy the easy access methods of pyparsing.ParseResults?

My idea looks like this:

def rearrange(log_token):
    # define a new ParseResults object in which I store matching input & output blocks
    function_blocks = pp.ParseResults(name='function_blocks')

    # find matching blocks
    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if (output_block.func_name == input_block.func_name
                and output_block.thread == input_block.thread
                and check_timestamp(output_block.timestamp_out,
                                    output_block.func_time,
                                    input_block.timestamp_in):
                # output_block and input_block match -> put them in a function_block
                function_blocks.append(input_block.pop() + output_block.pop())  # this addition causes a maximum recursion error?
    log_token.append(function_blocks)
    return log_token

This doesn't work though. The addition causes a maximum recursion error and the .pop() doesn't work as expected. It doesn't pop the whole block, it just pops the last entry in that block. Also, it doesn't actually remove that entry either, it justs removes it from the list, but it's still accessible by its results name.

It's also possible that some of theinput_blocks don't have a corresponding output_block (for example if the process crashes before all functions can finish). So my parse results should have the attributes input_blocks, output_blocks (for the spare blocks), and function_blocks (for the matching blocks).

Thanks for your help!

EDIT:

I made a simpler example to show my problem. Also, I experimented around and have a solution which kind of works but is a bit messy. I must admit there was a lot of trial-and-error included because I neither found documentation on nor can make sense of the inner workings of ParseResults and how to properly create my own nested ParseResults-structure.

from pyparsing import *

def main():
    log_data = '''\
    Func1_in
    Func2_in
    Func2_out
    Func1_out
    Func3_in'''

    ParserElement.inlineLiteralsUsing(Suppress)
    input_block = Group(Word(alphanums)('func_name') + '_in').setResultsName('input_blocks', listAllMatches=True)
    output_block = Group(Word(alphanums)('func_name') +'_out').setResultsName('output_blocks', listAllMatches=True)
    log = OneOrMore(input_block | output_block)

    parse_results = log.parseString(log_data)
    print('***** before rearranging *****')
    print(parse_results.dump())

    parse_results = rearrange(parse_results)
    print('***** after rearranging *****')
    print(parse_results.dump())

def rearrange(log_token):
    function_blocks = list()

    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if input_block.func_name == output_block.func_name:
              # found two matching blocks! now put them in a function_block
              # and delete them from their original positions in log_token
                # I have to do both __setitem__ and .append so it shows up in the dict and in the list
                # and .copy() is necessary because I delete the original objects later
                tmp_function_block = ParseResults()
                tmp_function_block.__setitem__('input', input_block.copy())
                tmp_function_block.append(input_block.copy())
                tmp_function_block.__setitem__('output', output_block.copy())
                tmp_function_block.append(output_block.copy())
                function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                              modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
                del function_block['input'], function_block['output']  # remove duplicate data

                function_blocks.append(function_block)
                # delete from original position in log_token
                input_block.clear()
                output_block.clear()
    log_token.__setitem__('function_blocks', sum(function_blocks))
    return log_token


if __name__ == '__main__':
    main()

Output:

***** before rearranging *****
[['Func1'], ['Func2'], ['Func2'], ['Func1'], ['Func3']]
- input_blocks: [['Func1'], ['Func2'], ['Func3']]
  [0]:
    ['Func1']
    - func_name: 'Func1'
  [1]:
    ['Func2']
    - func_name: 'Func2'
  [2]:
    ['Func3']
    - func_name: 'Func3'
- output_blocks: [['Func2'], ['Func1']]
  [0]:
    ['Func2']
    - func_name: 'Func2'
  [1]:
    ['Func1']
    - func_name: 'Func1'
***** after rearranging *****
[[], [], [], [], ['Func3']]
- function_blocks: [['Func1'], ['Func1'], ['Func2'], ['Func2'], [], []]   # why is this duplicated? I just want the inner function_blocks!
  - function_blocks: [[['Func1'], ['Func1']], [['Func2'], ['Func2']], [[], []]]
    [0]:
      [['Func1'], ['Func1']]
      - input: ['Func1']
        - func_name: 'Func1'
      - output: ['Func1']
        - func_name: 'Func1'
    [1]:
      [['Func2'], ['Func2']]
      - input: ['Func2']
        - func_name: 'Func2'
      - output: ['Func2']
        - func_name: 'Func2'
    [2]:                              # where does this come from?
      [[], []]
      - input: []
      - output: []
- input_blocks: [[], [], ['Func3']]
  [0]:                                # how do I delete these indexes?
    []                                #  I think I only cleared their contents
  [1]:
    []
  [2]:
    ['Func3']
    - func_name: 'Func3'
- output_blocks: [[], []]
  [0]:
    []
  [1]:
    []

Upvotes: 3

Views: 883

Answers (2)

milahu
milahu

Reputation: 3529

another demo for pyparsing setParseAction:
remove whitespace before the first value, preserve whitespace between values

i tried to solve this with pp.Optional(pp.White(' \t')).suppress()
but then i got a = ["b=1"] (parser did not stop at end-of-line)

def lstrip_first_value(src, loc, token):
    "remove whitespace before first value"
    # based on https://stackoverflow.com/a/51335710/10440128

    if token == []:
        return token

    # update the values
    copy = token[:]
    copy[0] = copy[0].lstrip()
    if copy[0] == "" and len(copy) > 1:
        copy = copy[1:]

    # update the token
    token.clear()
    token.extend(copy)
    token["value"] = copy
    return token

Values = (
    pp.OneOrMore(Value.leaveWhitespace())
    | pp.Empty().setParseAction(pp.replaceWith(""))
)("value").setParseAction(lstrip_first_value)

Value = pp.Combine(
    pp.QuotedString(quoteChar='"', escChar="\\")
    | pp.White(' \t') # parse whitespace to separate token
)

inputs

a=
b=2
a =  
b=2

the values of a should always be [""]

Upvotes: 0

PaulMcG
PaulMcG

Reputation: 63709

This version of rearrange addresses most of the issues I see in your example:

def rearrange(log_token):
    function_blocks = list()

    for input_block in log_token.input_blocks:
        # look for match among output blocks that have not been cleared
        for output_block in filter(None, log_token.output_blocks):

            if input_block.func_name == output_block.func_name:
                # found two matching blocks! now put them in a function_block
                # and clear them from in their original positions in log_token

                # create rearranged block, first with a list of the two blocks
                # instead of append()'ing, just initialize with a list containing
                # the two block copies
                tmp_function_block = ParseResults([input_block.copy(), output_block.copy()])

                # now assign the blocks by name
                # x.__setitem__(key, value) is the same as x[key] = value
                tmp_function_block['input'] = tmp_function_block[0]
                tmp_function_block['output'] = tmp_function_block[1]

                # wrap that all in another ParseResults, as if we had matched a Group
                function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                              modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output

                del function_block['input'], function_block['output']  # remove duplicate name references

                function_blocks.append(function_block)
                # clear blocks in their original positions in log_token, so they won't be matched any more
                input_block.clear()
                output_block.clear()

                # match found, no need to keep going looking for a matching output block 
                break

    # find all input blocks that weren't cleared (had matching output blocks) and append as input-only blocks
    for input_block in filter(None, log_token.input_blocks):
        # no matching output for this input
        tmp_function_block = ParseResults([input_block.copy()])
        tmp_function_block['input'] = tmp_function_block[0]
        function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                      modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
        del function_block['input']  # remove duplicate data
        function_blocks.append(function_block)
        input_block.clear()

    # clean out log_token, and reload with rearranged function blocks
    log_token.clear()
    log_token.extend(function_blocks)
    log_token['function_blocks'] =  sum(function_blocks)

    return log_token

And since this takes the input token and returns the rearranged tokens, you can make it a parse action as-is:

    # trailing '*' on the results name is equivalent to listAllMatches=True
    input_block = Group(Word(alphanums)('func_name') + '_in')('input_blocks*')
    output_block = Group(Word(alphanums)('func_name') +'_out')('output_blocks*')
    log = OneOrMore(input_block | output_block)
    log.addParseAction(rearrange)

Since rearrange updated log_token in place, if you make it a parse action, the ending return statement would be unnecessary.

It is interesting how you were able to update the list in-place by clearing those blocks that you had found matches for - very clever.

Generally, the assembly of tokens into ParseResults is an internal function, so the docs are light on this topic. I was just looking through the module docs and I don't really see a good home for this topic.

Upvotes: 2

Related Questions