Bogdan Emil Mariesan
Bogdan Emil Mariesan

Reputation: 5647

Select clause complex regex pattern

I am working on some application for my masters thesis and in the process I have to build a SQL Parser. To do so I've decide to go heavy on regexing since it seems the best way at the time.

The problem is that i have some minor problems with my regexes.

Considering some query examples such as:

select 
    RIC 
from 
    (select 
         s.RIC, m.NAME 
     from 
         Stock s, Market m 
     where 
         s.LISTED_ON_EXCHANGE = m.RIC) t 
where 
    RIC > 'G';

select * 
from Stock 
order by COMPANY 
LIMIT 0,2;

select 1+2;

select now();

select 
    s.RIC, m.NAME 
from 
    Stock s 
INNER JOIN  
    Market ON m I s.LISTED_ON_EXCHANGE = m.RIC;

select * 
from Stock 
order by COMPANY;

select * 
from Stock 
where RIC in ('GS.N' , 'INFY.BO');

select * 
from Stock 
where RIC LIKE 'V%';

select * 
from Stock 
where RIC BETWEEN 'G' AND 'I';

select count(*) 
from STOCK 
where LISTED_ON_EXCHANGE IS NOT NULL;

select na_me as n, price as p 
from bla, blabla, blalalaa;

And given the following two regexes:

SELECT_FIELDS_PATTERN = "(?<=[SELECT]) [\\d\\w',.*() ]+ (?=FROM)";

That should match selection fields.

And:

SELECT_FROM_PATTERN = "(?<=[FROM]) [\\w, ]+ (?(?=(?:WHERE|INNER|ORDER)))";

That should match FROM clauses excluding any conditions or ordering etc.

All of the queries except

select 1+2;
select now();

Should be valid. That's because I only want to parse select queries that contain relevant information for me.

The problem is that the two regexes I've created won't validate for example the last query:

select na_me as n, price as p from bla, blabla, blalalaa;

So I would require some help to improve my regexing for select queries, maybe even merge the two regexes?

An example of a correct output for the first query:

select RIC from (select s.RIC, m.NAME from Stock s, Market m where s.LISTED_ON_EXCHANGE=m.RIC) t where RIC > 'G';

The output should be:

RIC

for the first part and

(select s.RIC, m.NAME from Stock s, Market m where s.LISTED_ON_EXCHANGE=m.RIC) t

for the second part

Upvotes: 0

Views: 1156

Answers (1)

Qtax
Qtax

Reputation: 33908

  • Character classes are not groups -- remove [ & ] around keywords.
  • Don't use useless lookarounds, it can lead to problems in some cases.
  • You probably want to use \b around keywords so that SELECT does not match in FOOSELECT.
  • Can use (?i) to make the expression case insensitive.

You could use something like:

(?i)\bSELECT\b\s+(.+)\s+\bFROM\b\s+([\w\s,]+?)(?:\s+\b(?:WHERE|INNER|ORDER)\b|;?$)

With the parts of interest being captured in the first and second capturing group.

Note this will not work right with strings and in other cases, also SQL is recursive, which is pretty difficult to parse with Java regex. I suggest you use a proper parser if you want to parse SQL properly. (You can write a simple one your self, using regex for lexing generating tokens and Java to parse the tokens and build a parse tree.)

Upvotes: 2

Related Questions