Pundia
Pundia

Reputation: 161

Can't get a match for regular expression in Java

This is the format/example of the string I want to get data:

<span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#B82933;font-size:120%' href='/cartelera/pelicula/18312'>Español  </a></span><br><span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#FBEBC4;font-size:120%' href='/cartelera/pelicula/18313'>Subtitulada  </a></span><br>          </div>

And this is the regular expression I'm using for it:

"pelicula/([0-9]*)'>([\\w\\s]*)</a>"

I tested this regular expression in RegexPlanet, and it turned out OK, it gave me the expected result:

group(1) = 18313
group(2) = Subtitulada

But when I try to implement that regular expression in Java, it won't match anything. Here's the code:

Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>");              
            Matcher matcher = pattern.matcher(inputLine);            
            while(matcher.find()){
                    version = matcher.group(2);
                }
            }

What's the problem? If the regular expression is already tested, and in that same code I search for more patterns but I'm having trouble with two (I'm showing you here just one). Thank you in advance!

_EDIT__

I discovered the problem... If I check the sourcecode of the page it shows everything, but when I try to consume it from Java, it gets another sourcecode. Why? Because this page asks for your city so it can show information about that. I don't know if there's a workaround about that to actually access the information I want, but that's it.

Upvotes: 4

Views: 168

Answers (2)

mzzzzb
mzzzzb

Reputation: 1452

Your regex is correct but it seems \w does not match ñ.

I changed the regex to

"pelicula/([0-9]*)'>(.*?)</a>"

and it seems to match both the occurrences. Here I've used the reluctant *? operator to prevent .* match all characters in between first <a> till last <\a> See What is the difference between `Greedy` and `Reluctant` regular expression quantifiers? for explanation.

@Bohemian is correct in pointing out that you might need to enable the Pattern.DOTALL flag as well if the text in <a> has line breaks

Upvotes: 2

Bohemian
Bohemian

Reputation: 424983

If your input is over several lines (ie it contains newline characters) you'll need to turn on "dot matches newline".

There are two way to do this:

Use the "dot matches newline" regex switch (?s) in your regex:

Pattern pattern = Pattern.compile("(?s)pelicula/([0-9]*)'>([\\w\\s]*)</a>");

or use the Pattern.DOTALL flag in the call to Pattern.compile():

Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>", Pattern.DOTALL);

Upvotes: 1

Related Questions