daydreamer
daydreamer

Reputation: 92079

Java Regular Expression not working, same pattern works on online website

Problem

I am trying to to extract words from input

Pacific Gas & Electric (PG&E), San Diego Gas & Electric (SDG&E), Salt River Project (SRP), Southern California Edison (SCE)

I tried doing that online and my pattern (\w\s?&?\s?\(?\)?) seems to work.

But when I write my Java program, it is not finding it

private static void findWords() {
    final Pattern PATTERN = Pattern.compile("(\\w\\s?&?\\s?\\(?\\)?)");
    final String INPUT = "Pacific Gas & Electric (PG&E), San Diego Gas & Electric (SDG&E), Salt River Project (SRP), Southern California Edison (SCE)";

    final Matcher matcher = PATTERN.matcher(INPUT);
    System.out.println(matcher.matches());
}

It returns False

Question

  1. Why is there a mismatch, seems like my understanding is poor here
  2. How can I get the words out as groups, meaning Pacific Gas & Electric (PG&E) as match group1 and so on

Upvotes: 1

Views: 315

Answers (4)

You might want to re-evaluate the output you're getting from rubular.

from Documentation

The matches method attempts to match the entire input sequence against the pattern.

What you have there in rubular finds a bunch of matches because just about every character is a match.

nowhere in your rubular result will it tell you that the entire string is a match though. I'd re-evaluate the results you're seeing there.


and a regular expression to match words is extremely simple

you can use

\b\S*\b 

http://rubular.com/r/ljYs1xO1Qh

or simply

\S*

http://rubular.com/r/xgEuGse1lc

depending on your needs

Upvotes: 3

atamanroman
atamanroman

Reputation: 11818

Matcher#matches returns only true if the whole string matches the regular expression.

As you can see in your online matcher, your regex matches not the whole string but a single character (sometimes a bit more). So your regex matches "P" and "a" and "c" and "i" and so on. You should fix your regex first and then use Matcher#find() and Matcher#group() to get the matching groups.

Upvotes: 2

Sabuj Hassan
Sabuj Hassan

Reputation: 39385

If you want to get the matches out of your string, here this is you can try:

final String INPUT = "Pacific Gas & Electric (PG&E), San Diego Gas & Electric (SDG&E), Salt River Project (SRP), Southern California Edison (SCE)";
Pattern pattern = Pattern.compile("(.*?\\([^)]+\\))(?:,\\s*|$)");
Matcher m = pattern.matcher(INPUT);
while (m.find()) {
    System.out.println(m.group(1));
}

Alternately, you can do INPUT.split("\\s*,\\s*"); if the names doesn't contain any comma inside.

Now come to the question Why is there a mismatch, seems like my understanding is poor here: Because the matches() of String class perform matching over the whole string.

Upvotes: 0

Rohit Jain
Rohit Jain

Reputation: 213311

If you use Matcher#find() method instead of Matcher#matches() method, you'll get true as outcome. The reason being, the matches() method assumes implicit anchors - carat (^) and dollar ($) at the ends. So it would match the entire string with the regex. If that is not the case, it will return false.

Upvotes: 4

Related Questions