Shlomi Hassid
Shlomi Hassid

Reputation: 6606

Regex: extracting valid table names and columns from an MySQL query

I'm trying to match and extract all the table names and columns from any given MySQL query.

The given query is unquoted (back ticks) and According to MySQL the naming rules are:

Permitted characters in unquoted identifiers:

ASCII: [0-9,a-z,A-Z$_] (basic Latin letters, digits 0-9, dollar, underscore)

Extended: U+0080 .. U+FFFF

For a test case I'm using this query:

SELECT  users.id , users.first_name ,users.last_name,  roles.role,avatars.img_name,timezone.gmt_offset
FROM  users 
LEFT JOIN roles ON  users.role = roles.id 
LEFT JOIN  avatars ON users.avatar=avatars.id 
LEFT JOIN country ON users.country=country.country_code 
LEFT JOIN  timezone ON users.timezone = timezone.id 
WHERE (users.id >=2 AND  users.id <=4 ) OR (roles.role LIKE  'us%')
OR (roles.role =  'user(complex.sit )' && (timezone.gmt_offset >=7200
OR  users.last_name ='tryme'))
LIMIT 0 , 30

My Regex so far:

%[ .(),]?([a-z0-9_$]{2,})[ .(),]?(?!AND|OR|LIKE|SELECT|JOIN|ON)%i

I'm planning on capturing the group and replacing it with the match wrap with backticks The problem is That I cant filter out the reserved words that are being matched too (SELECT, JOIN....), I have tried adding a negative lookahead but it doesn't work.

The second problem is with values like in the example = 'user(complex.sit )' i dont want it to match those two words (complex sit).

Any suggestions?

Upvotes: 0

Views: 1285

Answers (2)

nitishagar
nitishagar

Reputation: 9413

Use the %g global modifier with the expression:

%[\s.(),]+?([a-z\d_$]{2,})[\s.(),]*?(?:AND|OR|LIKE|SELECT|JOIN|ON|)%g

Upvotes: 0

user663031
user663031

Reputation:

Yet another question of the form, "How do I manipulate some little program in language XXX using Regexps (in host language YYY)?" The underlying question is, "How do I parse a program in language XXX using regexps?" The correct answer is almost always, DON'T.

You are embarking on a path of tears. Regexp is not designed to parse any but the most of trivial, limited of languages. You may find a regexp which meets your immediate needs, but then when another requirement comes in, you hit a brick wall. The regexp(s) grow longer and longer and become impossible to understand, much less maintain.

To parse a language, use a parser. At this point in time, it's not really an exaggeration to say that there are parsers available for virtually all languages for virtually all platforms.

I don't know what language/platform you are working in, so I won't suggest any specific parsers, but a query for "JavaScript SQL parser" brought up this thing right away: https://www.npmjs.com/package/simple-sql-parser. Just meant as an example.

Upvotes: 1

Related Questions