user193661
user193661

Reputation: 939

Get correct brace grouping from string

I have files with incorrect JSON that I want to start fixing by getting it into properly grouped chunks.

The brace grouping {{ {} {} } } {{}} {{{}}} should already be correct

How can I grab all the top-level braces, correctly grouped, as separate strings?

Upvotes: 0

Views: 50

Answers (3)

mgilson
mgilson

Reputation: 310049

pyparsing can be really helpful here. It will handle pathological cases where you have braces inside strings, etc. It might be a little tricky to do all of this work yourself, but fortunately, somebody (the author of the library) has already done the hard stuff for us.... I'll reproduce the code here to prevent link-rot:

# jsonParser.py
#
# Implementation of a simple JSON parser, returning a hierarchical
# ParseResults object support both list- and dict-style data access.
#
# Copyright 2006, by Paul McGuire
#
# Updated 8 Jan 2007 - fixed dict grouping bug, and made elements and
#   members optional in array and object collections
#
json_bnf = """
object
    { members }
    {}
members
    string : value
    members , string : value
array
    [ elements ]
    []
elements
    value
    elements , value
value
    string
    number
    object
    array
    true
    false
    null
"""
from pyparsing import *

TRUE = Keyword("true").setParseAction( replaceWith(True) )
FALSE = Keyword("false").setParseAction( replaceWith(False) )
NULL = Keyword("null").setParseAction( replaceWith(None) )

jsonString = dblQuotedString.setParseAction( removeQuotes )
jsonNumber = Combine( Optional('-') + ( '0' | Word('123456789',nums) ) +
                    Optional( '.' + Word(nums) ) +
                    Optional( Word('eE',exact=1) + Word(nums+'+-',nums) ) )

jsonObject = Forward()
jsonValue = Forward()
jsonElements = delimitedList( jsonValue )
jsonArray = Group(Suppress('[') + Optional(jsonElements) + Suppress(']') )
jsonValue << ( jsonString | jsonNumber | Group(jsonObject)  | jsonArray | TRUE | FALSE | NULL )
memberDef = Group( jsonString + Suppress(':') + jsonValue )
jsonMembers = delimitedList( memberDef )
jsonObject << Dict( Suppress('{') + Optional(jsonMembers) + Suppress('}') )

jsonComment = cppStyleComment
jsonObject.ignore( jsonComment )

def convertNumbers(s,l,toks):
    n = toks[0]
    try:
        return int(n)
    except ValueError, ve:
        return float(n)

jsonNumber.setParseAction( convertNumbers )

Phew! That's a lot ... Now how do we use it? The general strategy here will be to scan the string for matches and then slice those matches out of the original string. Each scan result is a tuple of the form (lex-tokens, start_index, stop_index). For our use, we don't care about the lex-tokens, just the start and stop. We could do: string[result[1], result[2]] and it would work. We can also do string[slice(*result[1:])] -- Take your pick.

results = jsonObject.scanString(testdata)
for result in results:
    print '*' * 80
    print testdata[slice(*result[1:])]

Upvotes: 1

niemmi
niemmi

Reputation: 17263

If you don't want to install any extra modules simple function will do:

def top_level(s):
    depth = 0
    start = -1

    for i, c in enumerate(s):
        if c == '{':
            if depth == 0:
                start = i
            depth += 1
        elif c == '}' and depth:
            depth -= 1
            if depth == 0:
                yield s[start:i+1]

print(list(top_level('{{ {} {} } } {{}} {{{}}}')))

Output:

['{{ {} {} } }', '{{}}', '{{{}}}']

It will skip invalid braces but could be easily modified to report an error when they are spotted.

Upvotes: 2

Tim Pietzcker
Tim Pietzcker

Reputation: 336408

Using the regex module:

In [1]: import regex

In [2]: braces = regex.compile(r"\{(?:[^{}]++|(?R))*\}")

In [3]: braces.findall("{{ {} {} } } {{}} {{{}}}")
Out[3]: ['{{ {} {} } }', '{{}}', '{{{}}}']

Upvotes: 1

Related Questions