Reputation: 161
This is the format/example of the string I want to get data:
<span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#B82933;font-size:120%' href='/cartelera/pelicula/18312'>Español </a></span><br><span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#FBEBC4;font-size:120%' href='/cartelera/pelicula/18313'>Subtitulada </a></span><br> </div>
And this is the regular expression I'm using for it:
"pelicula/([0-9]*)'>([\\w\\s]*)</a>"
I tested this regular expression in RegexPlanet, and it turned out OK, it gave me the expected result:
group(1) = 18313
group(2) = Subtitulada
But when I try to implement that regular expression in Java, it won't match anything. Here's the code:
Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>");
Matcher matcher = pattern.matcher(inputLine);
while(matcher.find()){
version = matcher.group(2);
}
}
What's the problem? If the regular expression is already tested, and in that same code I search for more patterns but I'm having trouble with two (I'm showing you here just one). Thank you in advance!
_EDIT__
I discovered the problem... If I check the sourcecode of the page it shows everything, but when I try to consume it from Java, it gets another sourcecode. Why? Because this page asks for your city so it can show information about that. I don't know if there's a workaround about that to actually access the information I want, but that's it.
Upvotes: 4
Views: 168
Reputation: 1452
Your regex is correct but it seems \w
does not match ñ
.
I changed the regex to
"pelicula/([0-9]*)'>(.*?)</a>"
and it seems to match both the occurrences.
Here I've used the reluctant *?
operator to prevent .*
match all characters in between first <a>
till last <\a>
See What is the difference between `Greedy` and `Reluctant` regular expression quantifiers? for explanation.
@Bohemian is correct in pointing out that you might need to enable the Pattern.DOTALL
flag as well if the text in <a>
has line breaks
Upvotes: 2
Reputation: 424983
If your input is over several lines (ie it contains newline characters) you'll need to turn on "dot matches newline".
There are two way to do this:
Use the "dot matches newline" regex switch (?s)
in your regex:
Pattern pattern = Pattern.compile("(?s)pelicula/([0-9]*)'>([\\w\\s]*)</a>");
or use the Pattern.DOTALL
flag in the call to Pattern.compile()
:
Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>", Pattern.DOTALL);
Upvotes: 1