user1259401
user1259401

Reputation: 55

Can't find only one word using regular expressions

I'm trying to "find" a very specific series of characters in java, but my regular expression is not working properly.

I want to find a word (any word), then a space, then a forwardslash, then another space, then an 'M' (lower or upper case), then a series of digits. I'm using the following line:

Elements rating = doc.getElementsMatchingText(Pattern.compile("\\b\\s/\\s[mM][0-9]+")); 

But this is finding whole lines (words before and after the intended pattern). This also doesn't help:

Elements rating = doc.getElementsMatchingText(Pattern.compile("^\\b\\s/\\s[mM][0-9]+"));    

What am I doing wrong?

Upvotes: 0

Views: 733

Answers (3)

user1322340
user1322340

Reputation:

Your regex is flawed. I would propose

\w+ / [Mm]\d+

(remember to escape appropriately when you put in a java string)

A few things about your regex:

1) You don't have anything to currently match the "word (any word)" (!!!) I chose \w+ to match words only with at least one word character. You can do something like \w{2,10} to do words between 2 and 10 characters, for example, if you want to further customize.

2) You don't need the \b at all since the \w* match only matches valid word characters

3) keep in mind \s may match more than just a space.. I use just a space but you can put in \s if you are ok with it matching tab, new line (if configured that way), etc.

4) I think \d is more idiomatic and readable than [0-9]

Upvotes: 0

axtavt
axtavt

Reputation: 242786

The correct pattern for your case is \\b\\w+\\s/\\s[mM][0-9]+.

However, the problem you describe is related to the API you use rather than to the pattern. Note that getElementsMatchingText doesn't allow you to access match details, therefore you cannot extract part of the text that matched the pattern.

You need to iterate over all elements of the doc manually and apply Matcher.find() to the text of each element, or simply apply Matcher.find() with the same pattern again to the text of elements returned by getElementsMatchingText. Then you would be able to extract the matched part as Matcher.group().

Upvotes: 2

Simone
Simone

Reputation: 39

About the regex, try with:

.* \ [Mm][1-9]*

I used http://rubular.com/ to test my regex, so you can make your experiment.

Bye

Upvotes: -1

Related Questions