Reputation: 55
I'm trying to "find" a very specific series of characters in java, but my regular expression is not working properly.
I want to find a word (any word), then a space, then a forwardslash, then another space, then an 'M' (lower or upper case), then a series of digits. I'm using the following line:
Elements rating = doc.getElementsMatchingText(Pattern.compile("\\b\\s/\\s[mM][0-9]+"));
But this is finding whole lines (words before and after the intended pattern). This also doesn't help:
Elements rating = doc.getElementsMatchingText(Pattern.compile("^\\b\\s/\\s[mM][0-9]+"));
What am I doing wrong?
Upvotes: 0
Views: 733
Reputation:
Your regex is flawed. I would propose
\w+ / [Mm]\d+
(remember to escape appropriately when you put in a java string)
A few things about your regex:
1) You don't have anything to currently match the "word (any word)" (!!!) I chose \w+ to match words only with at least one word character. You can do something like \w{2,10} to do words between 2 and 10 characters, for example, if you want to further customize.
2) You don't need the \b at all since the \w* match only matches valid word characters
3) keep in mind \s may match more than just a space.. I use just a space but you can put in \s if you are ok with it matching tab, new line (if configured that way), etc.
4) I think \d is more idiomatic and readable than [0-9]
Upvotes: 0
Reputation: 242786
The correct pattern for your case is \\b\\w+\\s/\\s[mM][0-9]+
.
However, the problem you describe is related to the API you use rather than to the pattern.
Note that getElementsMatchingText
doesn't allow you to access match details, therefore you cannot extract part of the text that matched the pattern.
You need to iterate over all elements of the doc
manually and apply Matcher.find()
to the text of each element, or simply apply Matcher.find()
with the same pattern again to the text of elements returned by getElementsMatchingText
. Then you would be able to extract the matched part as Matcher.group()
.
Upvotes: 2
Reputation: 39
About the regex, try with:
.* \ [Mm][1-9]*
I used http://rubular.com/ to test my regex, so you can make your experiment.
Bye
Upvotes: -1