Thufir
Thufir

Reputation: 8487

regex last word in a sentence ending with punctuation (period)

I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."

The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?

thufir@dur:~/NetBeansProjects/regex$ 
thufir@dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar 


trying
a b cd efg hi
matches:
hi


trying
a b cd efg hi.
matches:
thufir@dur:~/NetBeansProjects/regex$ 

code:

package regex;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static void main(String[] args) {
        String matchesLastWordFine = "a b cd efg hi";
        lastWord(matchesLastWordFine);
        String noMatchFound = matchesLastWordFine + ".";
        lastWord(noMatchFound);
    }

    private static void lastWord(String sentence) {
        System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
        Pattern pattern = Pattern.compile("(\\w+)$");
        Matcher matcher = pattern.matcher(sentence);
        String match = null;
        while (matcher.find()) {
            match = matcher.group();
            System.out.println(match);
        }
    }
}

My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)

What regex should I put in the pattern?

Upvotes: 1

Views: 8865

Answers (6)

Taemyr
Taemyr

Reputation: 3437

If you need to have the whole match be the last word you can use lookahead.

\w+(?=(\.))

This matches a set of word characters that are followed by a period, without matching the period.

If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:

\w+(?=(\.?$))

Or if you want to also include ,!;: etc then

\w+(?=(\p{Punct}?$))

Upvotes: 2

Thufir
Thufir

Reputation: 8487

I don't understand why really, but this works:

package regex;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static void main(String[] args) {
        String matchesLastWordFine = "a b cd efg hi";
        lastWord(matchesLastWordFine);
        String noMatchFound = matchesLastWordFine + ".";
        lastWord(noMatchFound);
    }

    private static void lastWord(String sentence) {
        System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
        Pattern pattern = Pattern.compile("(\\w+)");  //(\w+)\.
        Matcher matcher = pattern.matcher(sentence);
        String match = null;
        while (matcher.find()) {
            match = matcher.group();
        }
        System.out.println(match);
    }
}

I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?

Upvotes: 0

smerlung
smerlung

Reputation: 1519

By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.

So you should just use:

(\w+)\.

the capture group will give the correct match.

You can see an example here

Upvotes: 0

jboi
jboi

Reputation: 11892

With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).

To write the regular expression in Java, use: "(\\w+)\\p{Punct}"

To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet

Upvotes: 0

ps-aux
ps-aux

Reputation: 12146

You can use lookahead asserion. For example to match sentence without period:

[\w\s]+(?=\.)

and

[\w]+(?=\.)  

For just last word (word before ".")

Upvotes: 3

davide
davide

Reputation: 1948

You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!

Upvotes: 1

Related Questions