Reputation: 36050
How will I be able to look for kewords that are not inside a string.
For example if I have the text:
Hello this text is an example.
bla bla bla "this text is inside a string"
"random string" more text bla bla bla "foo"
I will like to be able to match all the words text
that are not inside " "
. In other I will like to match:
note I do not want to match the text that is highlighted on red because it is inside a string
Possible solution:
I been working on it and this is what I have so far:
(?s)((?<q>")|text)(?(q).*?"|)
note that regex uses the if statement as: (?(predicate) true alternative|false alternative)
so the regex will read:
find " or text. If you find " then continue selecting until you find " again (.*?") if you find text then do nothing...
when I run that regex I match the whole string though. I am asking this question for purposes of learning. I know I can remove all strings then look for what I need.
Upvotes: 19
Views: 14907
Reputation: 43646
I have used these answers a lot of times till now and want to share alternative approach of fixing this, as sometimes I was not able to implement and use the given answers.
Instead of matching keywords out of something, break the tasks to two sub tasks:
For example, to replace the text in quotes I use:
[dbo].[fn_Utils_RegexReplace] ([TSQLRepresentation_WHERE], '''.*?(?<!\\)''', '')
or more clear: '.*?(?<!\\)'
.
I know that this may looks like double work and have performance impact on some platforms/languages, so everyone need to test this, too.
Upvotes: 0
Reputation: 30580
Here is one answer:
(?<=^([^"]|"[^"]*")*)text
This means:
(?<= # preceded by...
^ # the start of the string, then
([^"] # either not a quote character
|"[^"]*" # or a full string
)* # as many times as you want
)
text # then the text
You can easily extend this to handle strings containing escapes as well.
In C# code:
Regex.Match("bla bla bla \"this text is inside a string\"",
"(?<=^([^\"]|\"[^\"]*\")*)text", RegexOptions.ExplicitCapture);
Added from comment discussion - extended version (match on a per-line basis and handle escapes). Use RegexOptions.Multiline
for this:
(?<=^([^"\r\n]|"([^"\\\r\n]|\\.)*")*)text
In a C# string this looks like:
"(?<=^([^\"\r\n]|\"([^\"\\\\\r\n]|\\\\.)*\")*)text"
Since you now want to use **
instead of "
here is a version for that:
(?<=^([^*\r\n]|\*(?!\*)|\*\*([^*\\\r\n]|\\.|\*(?!\*))*\*\*)*)text
Explanation:
(?<= # preceded by
^ # start of line
( # either
[^*\r\n]| # not a star or line break
\*(?!\*)| # or a single star (star not followed by another star)
\*\* # or 2 stars, followed by...
([^*\\\r\n] # either: not a star or a backslash or a linebreak
|\\. # or an escaped char
|\*(?!\*) # or a single star
)* # as many times as you want
\*\* # ended with 2 stars
)* # as many times as you want
)
text # then the text
Since this version doesn't contain "
characters it's cleaner to use a literal string:
@"(?<=^([^*\r\n]|\*(?!\*)|\*\*([^*\\\r\n]|\\.|\*(?!\*))*\*\*)*)text"
Upvotes: 26
Reputation: 7912
I would simply greedily match the text's in quotes within a non-capturing group to filter them out and then use a capturing group for the non-quoted answer, like this:
".*(?:text).*"|(text)
which you might want to refine a little for word-boundaries etc. But this should get you where you wanna go, and be a clear readable sample.
Upvotes: 1
Reputation: 208565
This can get pretty tricky, but here is one potential method that works by making sure that there is an even number of quotation marks between the matching text and the end of the string:
text(?=[^"]*(?:"[^"]*"[^"]*)*$)
Replace text
with the regex that you want to match.
Rubular: http://www.rubular.com/r/cut5SeWxyK
Explanation:
text # match the literal characters 'text'
(?= # start lookahead
[^"]* # match any number of non-quote characters
(?: # start non-capturing group, repeated zero or more times
"[^"]*" # one quoted portion of text
[^"]* # any number of non-quote characters
)* # end non-capturing group
$ # match end of the string
) # end lookahead
Upvotes: 8