Reputation: 65
I need to parse a string that looks like this:
"prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
And I would like to get a list like the following:
['field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']
I've tried it with regular expressions, but it doesn't work in the substring of the pseudo-SQL statement.
How can I get that list?
Upvotes: 1
Views: 783
Reputation: 42143
The comma separators that are actually between fields will be at an even quote level. So, by changing these commas to \n characters, you could apply a simple .split("\n") on the string to get field values. You then only need to clean up the field values to remove leading/trailing spaces and quotes.
from itertools import accumulate
string = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
prefix,data = string.split(" ",1) # remove prefix
quoteLevels = accumulate( c == "'" for c in data ) # compute quote levels for each character
fieldData = "".join([ "\n" if c=="," and q%2 == 0 else c for c,q in zip(data,quoteLevels) ]) # comma to /n at even quote levels
fields = [ f.strip().strip("'") for f in fieldData.split("'\n '") ] # split and clean content
for i,field in enumerate(fields): print(i,field)
This will print:
0 field1
1
2 field2
3 field3
4 select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0
5 field5
6 field6
7 field7
8 field8
9 field9
10
11 field10
Upvotes: 0
Reputation: 286
Here's a brittle way to do it if you know what the SQL string is supposed to look like.
We match the SQL string, and split the rest into start and end strings.
Then we match the simpler field pattern and build up a list from start for that pattern, add back in the SQL match, and then the fields from the end string.
sqlmatch = 'select .* LIMIT 0'
fieldmatch = "'(|\w+)'"
match = re.search(sqlmatch, mystring)
startstring = mystring[:match.start()]
sql = mystring[match.start():match.end()]
endstring = mystring[match.end():]
result = []
for found in re.findall(fieldmatch, startstring):
result.append(found)
result.append(sql)
for found in re.findall(fieldmatch, endstring):
result.append(found)
Then the result list looks like the following:
['field1',
'',
'field2',
'field3',
'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\')
OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\') LIMIT 0',
'field5',
'field6',
'field7',
'field8',
'field9',
'',
'field10']
Upvotes: 2
Reputation: 120598
Since the number of fields is fixed and the non-sql fields do not have embedded quotes, there is a simple, three-line solution:
prefix, other = string.partition(' ')[::2]
fields = other.strip('\'').split('\', \'')
fields[4:-7] = [''.join(fields[4:-7])]
print(fields)
Output:
['field1', '', 'field2', 'field3', "select ... where (column1 = '2017') and ((('literal1literal2literal3literal4literal5literal6literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ", 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']
Upvotes: 1
Reputation: 5738
If the number of fields is constant, you could do something like this:
def splitter(string):
strip_chars = "\"' "
string = string[len('prefix '):] # remove the prefix
left_parts = string.split(',', 4) # only split up to 4 times
for i in left_parts[:-1]:
yield i.strip(strip_chars) # return what we've found so far
right_parts = left_parts[-1].rsplit(',', 7) # only split up to 7 times starting from the right
for i in right_parts:
yield i.strip(strip_chars) # return the rest
mystr = """prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"""
result = list(splitter(mystr))
print(repr(result))
# result:
[
'field1',
'',
'field2',
'field3',
'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\') LIMIT 0',
'field5',
'field6',
'field7',
'field8',
'field9',
'',
'field10'
]
Upvotes: 0
Reputation: 1574
Someone pointed out your string is malformed, I used this:
mystr = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9') LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
found = [a.replace("'", '').replace(',', '') for a in mystr.split(' ') if "'" in a]
Which returns:
['field1',
'',
'field2',
'field3',
'select',
'2017)',
'(((literal1',
'literal2',
'literal3',
'literal4',
'literal5',
'literal6',
'literal7)',
'(literal8)',
'literal9)',
'',
'field5',
'field6',
'field7',
'field8',
'field9',
'',
'field10']
Upvotes: 0