Chris Maness
Chris Maness

Reputation: 1686

Regex for almost JSON but not quite

Hello all I'm trying to parse out a pretty well formed string into it's component pieces. The string is very JSON like but it's not JSON strictly speaking. They're formed like so:

createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source="Region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}

With output just as chunks of text nothing special has to be done at this point.

createdAt=Fri Aug 24 09:48:51 EDT 2012 
id=238996293417062401 
text='Test Test' 
source="Region"
entities=[foo, bar] 
user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}

Using the following expression I am able to get most of the fields separated out

,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))(?=(?:[^']*'[^']*')*(?![^']*'))

Which will split on all the commas not in quotes of any type, but I can't seem to make the leap to where it splits on commas not in brackets or braces as well.

Upvotes: 2

Views: 275

Answers (2)

Laurence Gonsalves
Laurence Gonsalves

Reputation: 143104

Because you want to handle nested parens/brackets, the "right" way to handle them is to tokenize them separately, and keep track of your nesting level. So instead of a single regex, you really need multiple regexes for your different token types.

This is Python, but converting to Java shouldn't be too hard.

# just comma
sep_re = re.compile(r',')

# open paren or open bracket
inc_re = re.compile(r'[[(]')

# close paren or close bracket
dec_re = re.compile(r'[)\]]')

# string literal
# (I was lazy with the escaping. Add other escape sequences, or find an
# "official" regex to use.)
chunk_re = re.compile(r'''"(?:[^"\\]|\\")*"|'(?:[^'\\]|\\')*[']''')

# This class could've been just a generator function, but I couldn;'t
# find a way to manage the state in the match function that wasn't
# awkward.
class tokenizer:
  def __init__(self):
    self.pos = 0

  def _match(self, regex, s):
    m = regex.match(s, self.pos)
    if m:
      self.pos += len(m.group(0))
      self.token = m.group(0)
    else:
      self.token = ''
    return self.token

  def tokenize(self, s):
    field = '' # the field we're working on
    depth = 0  # how many parens/brackets deep we are
    while self.pos < len(s):
      if not depth and self._match(sep_re, s):
        # In Java, change the "yields" to append to a List, and you'll
        # have something roughly equivalent (but non-lazy).
        yield field
        field = ''
      else:
        if self._match(inc_re, s):
          depth += 1
        elif self._match(dec_re, s):
          depth -= 1
        elif self._match(chunk_re, s):
          pass
        else:
          # everything else we just consume one character at a time
          self.token = s[self.pos]
          self.pos += 1
        field += self.token
    yield field

Usage:

>>> list(tokenizer().tokenize('foo=(3,(5+7),8),bar="hello,world",baz'))
['foo=(3,(5+7),8)', 'bar="hello,world"', 'baz']

This implementation takes a few shortcuts:

  • The string escapes are really lazy: it only supports \" in double quoted strings and \' in single-quoted strings. This is easy to fix.
  • It only keeps track of nesting level. It does not verify that parens are matched up with parens (rather than brackets). If you care about that you can change depth into some sort of stack and push/pop parens/brackets onto it.

Upvotes: 2

Jay
Jay

Reputation: 19857

Instead of splitting on the comma, you can use the following regular expression to match the chunks that you want.

(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)

Python:

import re
text = "createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source=\"Region\", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}"
re.findall(r'(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)', text)

>> [
    ('createdAt', 'Fri Aug 24 09:48:51 EDT 2012'), 
    ('id', '238996293417062401'), 
    ('text', "'Test Test'"), 
    ('source', '"Region"'), 
    ('entities', '[foo, bar]'), 
    ('user', '{name=test, locations=[loc1,loc2], locations={comp1, comp2}}')
   ]

I've set up grouping so it will separate out the "key" and the "value". It will do the same in Java - See it working in Java here:

http://www.regexplanet.com/cookbook/ahJzfnJlZ2V4cGxhbmV0LWhyZHNyDgsSBlJlY2lwZRj0jzQM/index.html

Regular Expression explained:

  • (?:^| ) Non-capturing group that matches the beginning of a line, or a space
  • (.+?) Matches the "key" before the...
  • = equal sign
  • (\{.+?\}|\[.+?\]|.+?) Matches either a set of {characters}, [characters], or finally just characters
  • (?=,|$) Look ahead that matches either a , or the end of a line.

Upvotes: 1

Related Questions