martin.p
martin.p

Reputation: 363

Grouping sentences separated by specific word

I'm trying to group 2 sub-sentences of whatever reasonable length separated by a specific word (in the example "AND"), where the second can be optional. Some example:

CASE1:

foo sentence A AND foo sentence B

shall give

"foo sentence A" --> matching group 1

"AND" --> matching  group 2 (optionally)

"foo sentence B" --> matching  group 3

CASE2:

foo sentence A

shall give

"foo sentence A" --> matching  group 1
"" --> matching  group 2 (optionally)
"" --> matching  group 3

I tried the following regex

(.*) (AND (.*))?$

and it works but only if, in CASE2, i put an empty space at the final position of the string, otherwise the pattern doesn't match. If I include the space before "AND" inside round brackets group, in the case 1 the matcher includes the whole string in the first group. I wondered aroung lookahead and lookbehind assertions, but not sure they can help me. Any suggestion? Thanks

Upvotes: 1

Views: 728

Answers (5)

Toto
Toto

Reputation: 91488

I'd use this regex:

^(.*?)(?: (AND) (.*))?$

explanation:

The regular expression:

(?-imsx:^(.*?)(?: (AND) (.*))?$)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      AND                      'AND'
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
    (                        group and capture to \3:
----------------------------------------------------------------------
      .*                       any character except \n (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \3
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Upvotes: 2

Ro Yo Mi
Ro Yo Mi

Reputation: 15010

Description

This regex will return the requested string parts into the requested groups. The and is optional, if it's not found in the string then the entire string is placed into group 1. All the \s*? forces the captured groups to have their white space trimmed automatically.

^\s*?\b(.*?)\b\s*?(?:\b(and)\b\s*?\b(.*?)\b\s*?)?$

enter image description here

Groups

0 gets the entire matching string

  1. gets the string before the seperating word and, if no and then the entire string appears here
  2. gets the separating word, in this case it's and
  3. gets the second part of the string

Java Code Example:

Case 1

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "foo sentence A AND foo sentence B";
  Pattern re = Pattern.compile("^\\s*?\\b(.*?)\\b\\s*?(?:\\b(and)\\b\\s*?\\b(.*?)\\b\\s*?)?$",Pattern.CASE_INSENSITIVE);
  Matcher m = re.matcher(sourcestring);
    if(m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + groupIdx + "] = " + m.group(groupIdx));
      }
    }
  }
}

$matches Array:
(
    [0] => foo sentence A AND foo sentence B
    [1] => foo sentence A
    [2] => AND
    [3] =>  foo sentence B
)

Case 2, using the same regex

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "foo sentence A";
  Pattern re = Pattern.compile("^\\s*?\\b(.*?)\\b\\s*?(?:\\b(and)\\b\\s*?\\b(.*?)\\b\\s*?)?$",Pattern.CASE_INSENSITIVE);
  Matcher m = re.matcher(sourcestring);
    if(m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + groupIdx + "] = " + m.group(groupIdx));
      }
    }
  }
}

$matches Array:
(
    [0] => foo sentence A
    [1] => foo sentence A
)

Upvotes: 2

Bohemian
Bohemian

Reputation: 425208

Change your regex to make the space after he first sentence optional:

(.*\\S) ?(AND (.*))?$

Or you could use split() to consume the AND and any surrounding spaces:

String sentences = sentence.spli("\\s*AND\\s*");

Upvotes: 0

Kent
Kent

Reputation: 195199

your case 2 is a little strange...

but I would do

String[] parts = sentence.split("(?<=AND)|(?=AND)"));

you check the parts.length. if length==1, then it is case2. you just have the sentence in array, you could add empty string as your "group2/3"

if in case1 you have directly parts:

[foo sentence A , AND,  foo sentence B]

Upvotes: 0

greedybuddha
greedybuddha

Reputation: 7507

How about just using

String split[] = sentence.split("AND");

That will split the sentence up by your word and give you a list of subparts.

Upvotes: 2

Related Questions