Reputation: 267
I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence. Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename , can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this: (?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all. .(?<=.)(.?)(?=.)(.?). - This everything before and after.
Upvotes: 1
Views: 2685
Reputation: 2342
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ]
Means to make a character class including everything except for what's between the brackets. In this case, I put \s
in there, which matches any whitespace. So [^\s]
matches anything that isn't whitespace.+
Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \.
, which matches a literal .
. The \
is necessary because .
means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a .
in this group of non-whitespace characters.[^\s]+
, which matches everything after the .
until the next whitespace.Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... )
where the ...
is the pattern you want to look for. Using your example, this will only match after the keywords INTO
and FROM
:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM)
means to match either the text INTO
or the text FROM
. I then specify that it should be followed by a whitespace character with \s
. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i
to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i)
in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Upvotes: 1
Reputation: 365
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript: const regex = /([\w][.][\w])+/gm;
in Java: final String regex = "([\w][.][\w])+";
in Python: regex = r"([\w][.][\w])+"
in PHP: $re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details
Upvotes: 0