How to extract the column names from a sql query

Question

I have extracted just the column fields from a query like this

query_split = [query[query.find("select") + len("select"): query.find("from")]]

I get a string like this

 query_split = [' service,count(*) as count,round(sum(mrp),2) as sale ']

I want to get a list which looks like this

[' service','count(*) as count','round(sum(mrp),2) as sale']

This is because I want to get the list of column names

['service','count','sale']

I have tried other methods such as

for file in reader(query_split):
    print(file)

Gives me the output

[' service', 'count(*) as count', 'round(sum(mrp)', '2) as sale ']

when i took the test case which uses round(sum(mrp),2) type operation in query the below function failed at that point

def get_column_name(query):
    """
    Extracts the column name from a sql query
    :param query: str
    :return: column_name
    list: Column names which that query will fetch
    """
    column_name=[]
    query_split = query[query.find("select") + len("select"): query.find("from")]
    for i in query_split.split(','):

        if "as" in i:
            column_name.append(i.split('as')[-1])
        else:
            column_name.append(i.split(' ')[-1])
    return column_name

wp78de · Accepted Answer

Your problem is that the SQL at play here features nested constructs.

The most likely cleanest solution is to have a SQL parser that understands the MySQL dialect. Arguably, it can be done most easily with ANTLR; you can find a MySQL grammar here and a quick guide here if you are curious.

To approach this with regex we need to account for balanced parenthesis with a recursive regex in a match pattern like this:

[^,]+($(?>[^()]++|(?1))*+$)[^,]+|([^(),]+(?:,|$))

Explanation:

[^,]+($(?>[^()]++|(?1))*+$)[^,]+ the recursive regex to match pairs of () and everything inbetween (including commas) sourounded by a negated character class that matches everything but a comma.
([^(),]+(?:,|$)) matches regular columns

Demo

Sample Code:

import regex as re
regex = r"[^,]+($(?>[^()]++|(?1))*+$)[^,]+|([^(),]+(?:,|$))"
test_str = "service,count(*) as count,round(sum(mrp),2) as sale,count(*) as count2,round(sum(mrp),2) as sale2"
matches = re.finditer(regex, test_str, re.MULTILINE)
result = [match.group() for match in matches]

Outputs:

['service,', 'count(*) as count', 'round(sum(mrp),2) as sale', 'count(*) as count2', 'round(sum(mrp),2) as sale2']

Since we are using PCRE regex features you will need to install Python's alternative regex package to run the code. Good luck.

How to extract the column names from a sql query

Answers (1)

Related Questions