Gerard
Gerard

Reputation: 11

Skipping a token-defined number of tokens in Python PLY

So I've got a language that's a bytestring representing a list of the following header+data combos (e.g. headerdataheaderdataheaderdata...):

Header

Data

Tokens

Just one token per byte value:

b00 = r'\x00'
...
bFF = r'\xFF'

Grammar

file      -> segments
segments  -> segment segment
           | segment
segment   -> delim id timestamp type group_id owner datalen checksum
delim     -> bFF bAA
id        -> int32
timestamp -> int32
type      -> int32
group_id  -> int16
owner     -> int32
datalen   -> int32
checksum  -> int32
int32     -> byte byte byte byte
int16     -> byte byte
byte      -> <oh my god one rule per value token>

The ploblem

I know this isn't the typical context free language you'd normally work with in PLY. The length of each segment depends on a number contained within it. However it's easy to get that data as an embedded action in the 'segment' rule:

def p_segment(t):
    ''' segment : delim id timestamp type group_id owner datalen checksum'''
    id = t[2]
    timestamp = t[3]
    type = t[4]
    group_id = t[5]
    owner = t[6]
    datalen = t[7]
    checksum = t[8]
    t[0] = (id,timestamp,type,group_id,owner,datalen,checksum)
    # Assume all rules for t[2:8] return the correct data type haha

Now my thought was to just accumulate the extra bytes and store them somewhere with lexer.token():

def p_segment(t):
    ''' segment : delim id timestamp type group_id owner datalen checksum'''
    id = t[2]
    timestamp = t[3]
    type = t[4]
    group_id = t[5]
    owner = t[6]
    datalen = t[7]
    checksum = t[8]

    data = []
    for i in range(datalen):
        data += t.lexer.token()

    t[0] = (id,timestamp,type,group_id,owner,datalen,checksum,data)

This works to an extent--data does have the data in it, and t.lexer.lexpos is updated, however the parser loses its marbles with a syntax error right after the last byte of the header. This seems to imply that while the lexer is getting advanced along the string, the parser isn't. How can I fix that? Should I abandon PLY altogether? (And if so, what's a suitable alternative?)

Also I've tried adding a rule for the data, but just adding a 'segment_data' rule doesn't really work, as there's no delimiter or otherwise context-free length to faithfully rely on:

def p_segment_data(t):
    ''' 
    segment_data : byte segment-data
                 | byte
    '''
    if len(t) > 2:
        t[0] = [t[1]] + t[2] # we want to return a list of bytes
    else:
        t[0] = [t[1]]

This in practice generates a list of bytes, but it simply munches ALL of the remaining data after the first segment header.

Upvotes: 1

Views: 524

Answers (1)

rici
rici

Reputation: 241781

Unless there is something more to your language, a context-free parser is really not an appropriate tool. You probably could force the square peg into the round hole, but why bother?

Fixed-length headers can be broken apart really easily using struct.unpack_from, and once you have the length of the payload, you can extract it with an ordinary Python slice. Presumably,you would then verify the checksum, before attempting to do anything with the bytestring.

If the payload contains content in a language described by a context-free grammar, you could then use a Ply-based parser to parse just the string.

Upvotes: 0

Related Questions