Reputation: 65

Parsing a string with nested quotes

I need to parse a string that looks like this:

"prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"

And I would like to get a list like the following:

['field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']

I've tried it with regular expressions, but it doesn't work in the substring of the pseudo-SQL statement.

How can I get that list?

Upvotes: 1

Answers (5)

Alain T.

Reputation: 42143

The comma separators that are actually between fields will be at an even quote level. So, by changing these commas to \n characters, you could apply a simple .split("\n") on the string to get field values. You then only need to clean up the field values to remove leading/trailing spaces and quotes.

from itertools import accumulate

string      = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
prefix,data = string.split(" ",1)                   # remove prefix
quoteLevels = accumulate( c == "'" for c in data )  # compute quote levels for each character
fieldData   = "".join([ "\n" if c=="," and q%2 == 0 else c for c,q in zip(data,quoteLevels) ]) # comma to /n at even quote levels
fields      = [ f.strip().strip("'") for f in fieldData.split("'\n '") ] # split and clean content

for i,field in enumerate(fields): print(i,field)

This will print:

0 field1
1 
2 field2
3 field3
4 select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 
5 field5
6 field6
7 field7
8 field8
9 field9
10 
11 field10

Upvotes: 0

scratchpad

Reputation: 286

Here's a brittle way to do it if you know what the SQL string is supposed to look like.

We match the SQL string, and split the rest into start and end strings.

Then we match the simpler field pattern and build up a list from start for that pattern, add back in the SQL match, and then the fields from the end string.

sqlmatch = 'select .* LIMIT 0'
fieldmatch = "'(|\w+)'"
match = re.search(sqlmatch, mystring)
startstring = mystring[:match.start()]
sql = mystring[match.start():match.end()]
endstring = mystring[match.end():]
result = []
for found in re.findall(fieldmatch, startstring):
    result.append(found)

result.append(sql)
for found in re.findall(fieldmatch, endstring):
    result.append(found)

Then the result list looks like the following:

['field1',
 '',
 'field2',
 'field3',
 'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\') 
OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\')  LIMIT 0',
 'field5',
 'field6',
 'field7',
 'field8',
 'field9',
 '',
 'field10']

Upvotes: 2

ekhumoro

Reputation: 120598

Since the number of fields is fixed and the non-sql fields do not have embedded quotes, there is a simple, three-line solution:

prefix, other = string.partition(' ')[::2]
fields = other.strip('\'').split('\', \'')
fields[4:-7] = [''.join(fields[4:-7])]

print(fields)

Output:

['field1', '', 'field2', 'field3', "select ... where (column1 = '2017') and ((('literal1literal2literal3literal4literal5literal6literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ", 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']

Upvotes: 1

solarc

Reputation: 5738

If the number of fields is constant, you could do something like this:

def splitter(string):
    strip_chars = "\"' "
    string = string[len('prefix '):] # remove the prefix
    left_parts = string.split(',', 4) # only split up to 4 times
    for i in left_parts[:-1]:
        yield i.strip(strip_chars) # return what we've found so far
    right_parts = left_parts[-1].rsplit(',', 7) # only split up to 7 times starting from the right
    for i in right_parts:
        yield i.strip(strip_chars) # return the rest

mystr = """prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"""
result = list(splitter(mystr))
print(repr(result))


# result:
[
    'field1',
    '',
    'field2',
    'field3',
    'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\')  LIMIT 0',
    'field5',
    'field6',
    'field7',
    'field8',
    'field9',
    '',
    'field10'
]

Upvotes: 0

E.Serra

Reputation: 1574

Someone pointed out your string is malformed, I used this:

mystr = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"

found = [a.replace("'", '').replace(',', '') for a in mystr.split(' ') if "'" in a]

Which returns:

['field1',
 '',
 'field2',
 'field3',
 'select',
 '2017)',
 '(((literal1',
 'literal2',
 'literal3',
 'literal4',
 'literal5',
 'literal6',
 'literal7)',
 '(literal8)',
 'literal9)',
 '',
 'field5',
 'field6',
 'field7',
 'field8',
 'field9',
 '',
 'field10']

Upvotes: 0

Parsing a string with nested quotes

Answers (5)

Related Questions