LeO
LeO

Reputation: 5238

Java: RegExp for matching words between a quote

I have the following test string

This is my "te

st" case
with lines for "tes"t"ing" with regex
But as he said "It could be an arbitrary number of words"

And I want to match everything which is between " as long as it is bound to words. I have the following regexp:

\"([^\"]*)\"

which matches quite well the words of "test" even if its split apart. Is there a way to find a tes"t"ing as well a whole word (and not split apart into two words? Trying with the word boundaries \b (\b\"([^\"]*)\"\b) doesn't work very well because it won't match the very first " nor the just mentioned group.

I need it for Java regexp.

UPDATE As a result I need to have

This is my \q{te

st} case
with lines for \q{tes"t"ing} with regex
But as he said \q{It could be an arbitrary number of words}

Upvotes: 3

Views: 114

Answers (3)

Cary Swoveland
Cary Swoveland

Reputation: 110685

You could use the regular expression

(?<=\")(?:[a-z]+\"[a-z]+\"[a-z]+|[a-z][^"]+)(?=\")

with the case-indifferent flag i (or preface with (?i)).

Demo

As seen at the link this regex matches the following three substrings of the text given in the question:

te                                                                    st
tes"t"ing
It could be an arbitrary number of words

​ The regex engine performs the following operations:

(?<=\")    # match a double-quote in a positive lookbehind
(?:        # begin a non-capture group
  [a-z]+\" # match 1+ letters, then a double-quote
  [a-z]+\" # match 1+ letters, then a double-quote
  [a-z]+   # match 1+ letters
  |        # or
  [a-z]    # match 1 letter
  [^"]+    # match 1+ characters other than a double-quote
)          # end non-capture group
(?=\")     # match a double-quote in a positive lookahead

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You may use

.replaceAll("\\B\"\\b(.*?)\\b\"\\B", "\\\\q{$1}")

Or, if the matches may span across multiple lines, add (?s) modifier:

.replaceAll("(?s)\\B\"\\b(.*?)\\b\"\\B", "\\\\q{$1}")

See the regex demo .

Details

  • \B"\b - a " that is either at the start of the string or preceded with a non-word char, and that is followed with a word char
  • (.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
  • \b"\B - a " that is either at the end of the string or followed with a non-word char, and that is preceded with a word char.

The replacement is a backslash ("\\\\", note the double literal backslash is necessary in the regex replacement part to insert a real, literal backslash since a backslash is a special char in the replacement pattern), q{, the Group1 value ($1) and a }.

See the Java demo:

String s = "This is my \"te\n\nst\" case\nwith lines for \"tes\"t\"ing\" with regex\nBut as he said \"It could be an arbitrary number of words\"";
System.out.println(s.replaceAll("\\B\"\\b(.*?)\\b\"\\B", "\\\\q{$1}"));

Output:

This is my "te

st" case
with lines for \q{tes"t"ing} with regex
But as he said \q{It could be an arbitrary number of words}

NOTE:

If you also need to match two consecutive double quotes that are not preceded, nor followed with word characters, you can modify the above regular expression the following way:

 .replaceAll("(?s)\\B(\"\\b(.*?)\\b\"|\"\")\\B", "\\\\q{$2}")

See the regex demo.

Details

  • (?s) - an embedded flag option (equal to Pattern.DOTALL) that makes . match line break chars, too
  • \B - a non-word boundary, here, it means that immediately to the left, there must be a non-word char or start of string (because after \B, there is a non-word char, ")
  • ( - start of the first capturing group:
    • "\b(.*?)\b" - " followed with a word char, then Group 2 capturing any zero or more chars, as few as possible, and then a " that is preceded with a word char (that is why this pattern can't match "", since after the first and before the second, there must be a letter, digit or _)
    • | - or
    • "" - a "" substring
  • ) - end of the first capturing group
  • \B - a non-word boundary, here, it means that immediately to the right, there must be a non-word char or end of string (because before \B, there is a non-word char, ").

Upvotes: 2

anubhava
anubhava

Reputation: 785156

You may use this regex that used lookbehind and lookahead to ensure that previous and next characters is not a non-whitespace character:

(?<!\S)".*?"(?!\S)

RegEx Demo

Adding helpful comment from OP which worked to solve the problem which was a bit more than what was mentioned in question:

str = str.replaceAll("(?s)(?<!\\S)\"(.*?)\"(?!\\S)", "\\\\q{$1}"); 

Upvotes: 2

Related Questions