Reputation: 12264
I need to parse some information from a SQL DDL statement using regex. The SQL statement looks like this:
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
serialization.format '1'
)
PARTITIONED BY (DATA2, DATA3)
I need to parse it in Python and pull out the columns named in the PARTITIONED BY
clause. I've figured out a regex to achieve it after removing newline characters but I can't get it to work if there are newline chars in there. here is some demo code:
import re
def print_partition_columns_if_found(ddl_string):
regex = r'CREATE +?(TEMPORARY +)?TABLE *(?P<db>.*?\.)?(?P<table>.*?)\((?P<col>.*?)\).*?USING +([^\s]+)( +OPTIONS *\([^)]+\))?( *PARTITIONED BY \((?P<pcol>.*?)\))?'
match = re.search(regex, ddl_string, re.MULTILINE | re.DOTALL)
if match.group("pcol"):
print match.group("pcol").strip()
else:
print 'did not find any pcols in {matches}'.format(matches=match.groups())
ddl_string1 = """
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet OPTIONS (serialization.format '1') PARTITIONED BY (DATA2, DATA3)"""
print_partition_columns_if_found(ddl_string1)
print "--------"
ddl_string2 = """
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
serialization.format '1'
)
PARTITIONED BY (DATA2, DATA3)
"""
print_partition_columns_if_found(ddl_string2)
That returns:
DATA2, DATA3
--------
did not find any pcols in (None, 'default.', 'table1 ', 'DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT', 'parquet', None, None, None)
Any regex experts care to help me out?
Upvotes: 1
Views: 2527
Reputation: 3886
Lets check python sqlparse documentation: Documentation - getting started
>>> import sqlparse
>>> ddl_string2 = """
... CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
... USING parquet
... OPTIONS (
... serialization.format '1'
... )
... PARTITIONED BY (DATA2, DATA3)
... """
>>> ddl_string1 = """
... CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
... USING parquet OPTIONS (serialization.format '1') PARTITIONED BY (DATA2, DATA3)"""
>>> def print_partition_columns_if_found(sql):
... parse = sqlparse.parse(sql)
... data = next(item for item in reversed(parse[0].tokens) if item.ttype is None)[1]
... print(data)
...
>>> print_partition_columns_if_found(ddl_string1)
DATA2, DATA3
>>> print_partition_columns_if_found(ddl_string2)
DATA2, DATA3
>>>
Upvotes: 3