Protagonist
Protagonist

Reputation: 1669

How to count a word having an apostrophe as two separate words using Java regular expressions

I have a string which is having a word with an apostrophe. Ex- He is a very very good boy, isn't he?

public class Solution {

      public static void main(String[] args) {

           String s = "He is a very very good boy, isn't he?";
           String[] words = s.split("\\s+");
           int itemCount = words.length;
           System.out.println(itemCount);

           for (int i = 0; i < itemCount; i++) {
                String word = words[i];
                System.out.println(word);
           }
     }
}

Output I'm getting is 9 words. But I want the count as 10, by separating isn't as 2 words. How to do it using the above Regular Expression?

Upvotes: 2

Views: 1026

Answers (4)

Akash Thakare
Akash Thakare

Reputation: 23002

I think you want isn't to be is not and so count them as 2 separate words and not single one.

You can have or (|) in split regular expression,

\\s+|'t

This will only for 't and it will avoid to count for sentence like my friend's birthday.. here apostrophe should not be considered for another word.

But that's not just an end of the story. There are lot of other contractions are there which should be consider in such expression.

i.e.

  • 't : isn't, aren't, wasn't, weren't, wouldn't, didn't etc.
  • 's : it's, that's, etc. (This is difficult one)
  • 'd : I'd, you'd etc.
  • 'll : I'll, they'll etc. ...

So ultimately following regular expression will solve 90% of the problem counting word.

\\s+|'t|'d|'ll

Problem with 's(apostrophe S) is it comes with subject like Dog's, Cat's etc. which shows possession and these should not be considered as two separate words. On the other end some time we use 's to write It is, That is(That's, It's) etc. You can add the expressions in existing regular expression to differentiate between contractions and apostrophe which shows possession.

Note : This is only for counting the words and it will split isn't as isn and (space), 't will be removed.

Upvotes: 0

Bohemian
Bohemian

Reputation: 425198

Split on non-word chars:

String[] words = s.split("\\W+")

Upvotes: 0

Ashish Patil
Ashish Patil

Reputation: 4624

You can try using p{Punct}, which ignores characters like ?!

        String s = "He is a very very good boy, isn't he?";
        String[] words = s.split("[\\p{Punct}\\s]+");
        int itemCount = words.length;
        System.out.println(itemCount);
        for (int i = 0; i < itemCount; i++) {
            String word = words[i];
            System.out.println(word);
        }

Upvotes: 0

Andrew Lygin
Andrew Lygin

Reputation: 6197

It would be more reliable to use the \w construct:

Pattern p = Pattern.compile("(\\w)+");
Matcher m = p.matcher("He is a very very good boy, isn't he?");
while (m.find()) {
    System.out.println(m.group(0));
}

Otherwise, you need to handle too many situations manually, for instance: "He's a very good boy.Isn't he?".

Upvotes: 1

Related Questions