Reputation: 5647
I am working on some application for my masters thesis and in the process I have to build a SQL Parser. To do so I've decide to go heavy on regexing since it seems the best way at the time.
The problem is that i have some minor problems with my regexes.
Considering some query examples such as:
select
RIC
from
(select
s.RIC, m.NAME
from
Stock s, Market m
where
s.LISTED_ON_EXCHANGE = m.RIC) t
where
RIC > 'G';
select *
from Stock
order by COMPANY
LIMIT 0,2;
select 1+2;
select now();
select
s.RIC, m.NAME
from
Stock s
INNER JOIN
Market ON m I s.LISTED_ON_EXCHANGE = m.RIC;
select *
from Stock
order by COMPANY;
select *
from Stock
where RIC in ('GS.N' , 'INFY.BO');
select *
from Stock
where RIC LIKE 'V%';
select *
from Stock
where RIC BETWEEN 'G' AND 'I';
select count(*)
from STOCK
where LISTED_ON_EXCHANGE IS NOT NULL;
select na_me as n, price as p
from bla, blabla, blalalaa;
And given the following two regexes:
SELECT_FIELDS_PATTERN = "(?<=[SELECT]) [\\d\\w',.*() ]+ (?=FROM)";
That should match selection fields.
And:
SELECT_FROM_PATTERN = "(?<=[FROM]) [\\w, ]+ (?(?=(?:WHERE|INNER|ORDER)))";
That should match FROM clauses excluding any conditions or ordering etc.
All of the queries except
select 1+2;
select now();
Should be valid. That's because I only want to parse select queries that contain relevant information for me.
The problem is that the two regexes I've created won't validate for example the last query:
select na_me as n, price as p from bla, blabla, blalalaa;
So I would require some help to improve my regexing for select queries, maybe even merge the two regexes?
An example of a correct output for the first query:
select RIC from (select s.RIC, m.NAME from Stock s, Market m where s.LISTED_ON_EXCHANGE=m.RIC) t where RIC > 'G';
The output should be:
RIC
for the first part and
(select s.RIC, m.NAME from Stock s, Market m where s.LISTED_ON_EXCHANGE=m.RIC) t
for the second part
Upvotes: 0
Views: 1156
Reputation: 33908
[
& ]
around keywords.\b
around keywords so that SELECT
does not match in FOOSELECT
.(?i)
to make the expression case insensitive. You could use something like:
(?i)\bSELECT\b\s+(.+)\s+\bFROM\b\s+([\w\s,]+?)(?:\s+\b(?:WHERE|INNER|ORDER)\b|;?$)
With the parts of interest being captured in the first and second capturing group.
Note this will not work right with strings and in other cases, also SQL is recursive, which is pretty difficult to parse with Java regex. I suggest you use a proper parser if you want to parse SQL properly. (You can write a simple one your self, using regex for lexing generating tokens and Java to parse the tokens and build a parse tree.)
Upvotes: 2