Hari Chaudhary
Hari Chaudhary

Reputation: 650

Splitting words from text using regex

I need to filter the given text to get all words, including apostrophes (can't is considered a single word).

Para = "'hello' world '"

I am splitting the text using

String[] splits = Para.split("[^a-zA-Z']");

Expected output:

hello world

But it is giving:

'hello' world '

I get everything right, except a single apostrophe (') and 'hello' are not getting filtered by the above regex.

How can I filter these two things?

Upvotes: 0

Views: 2592

Answers (3)

stema
stema

Reputation: 92976

A Unicode version, without lookarounds:

String TestInput = "This voilà München is the test' 'sentence' that I'm willing to split";

String[] splits = TestInput.split("'?[^\\p{L}']+'?");

for (String t : splits) {
    System.out.println(t);
}

\p{L} is matching a character with the Unicode property "Letter"

This splits on a non letter, non ' sequence, including a leading or trailing ' in the split.

Output:

This
voilà
München
is
the
test
sentence
that
I'm
willing
to
split

To handle leading and trailing ', just add them as alternatives

TestInput.split("'?[^\\p{L}']+'?|^'|'$")

Upvotes: 1

Bernhard Barker
Bernhard Barker

Reputation: 55589

As far as I can tell, you're looking for a ' where either the next or previous character is not a letter.

The regex I came up with to do this, contained in some test code:

String str = "bob can't do 'well'";
String[] splits = str.split("(?:(?<=^|[^a-zA-Z])'|'(?=[^a-zA-Z]|$)|[^a-zA-Z'])+");
System.out.println(Arrays.toString(splits));

Explanation:

(?<=^|[^a-zA-Z])' - matches a ' where the previous character is not a letter, or we're at the start of the string.
'(?=[^a-zA-Z]|$) - matches a ' where the next character is not a letter, or we're at the end of the string.
[^a-zA-Z'] - not a letter or '.
(?:...)+ - one or more of any of the above (the ?: is just to make it a non-capturing group).

See this for more on regex lookaround ((?<=...) and (?=...)).

Simplification:

The regex can be simplified to the below by using negative lookaround:

"(?:(?<![a-zA-Z])'|'(?![a-zA-Z])|[^a-zA-Z'])+"

Upvotes: 1

nhahtdh
nhahtdh

Reputation: 56809

If you define a word as a sequence that:

  • Must start and end with English alphabet a-zA-Z
  • May contain apostrophe (') within.

Then you can use the following regex in Matcher.find() loop to extract matches:

[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?

Sample code:

Pattern p = Pattern.compile("[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?");
Matcher m = p.matcher(inputString);

while (m.find()) {
    System.out.println(m.group());
}

Demo1

1 The demo uses PCRE flavor regex, but the result should not be different from Java for this regex

Upvotes: 0

Related Questions