Reputation: 366
I got a problem with my parser. I want to read an image-link on a webiste and this normally works fine. But today I got a link that contains special chars and the usual regex did not work.
This is how my code looks like.
Pattern t = Pattern.compile(regex.trim());
Matcher x = t.matcher(content[i].toString());
if(x.find())
{
values[i] = x.group(1);
}
And this is the part of html, that causes trouble
<div class="open-zoomview zoomlink" itemscope="" itemtype="http://schema.org/Product">
<img class="zoomLink productImage" src="
http://tnm.scene7.com/is/image/TNM/template_335x300?$plus_335x300$&$image=is{TNM/1098845000_prod_001}&$ausverkauft=1&$0prozent=1&$versandkostenfrei=0" alt="Produkt Atika HB 60 Benzin-Heckenschere" title="Produkt Atika HB 60 Benzin-Heckenschere" itemprop="image" />
</div>
And this is the regex I am using to get the part in the src-attribute:
<img .*src="(.*?)" .*>
I believe that it has something to do with all the special character inside the link. But I'm not sure how to escape all of them. I Already tried
Pattern.quote(content[i].toString())
But the outcome was the same: nothing found.
Upvotes: 0
Views: 193
Reputation: 43673
You should actually use <img\\s\\.*?\\bsrc=["'](\\.*?)["']\\.*?>
with (?s)
modifier.
Upvotes: 0
Reputation: 1054
This probably caused by the newline within the tag. The . character won't match it.
Did you consider not using regex to parse HTML? Using regex for HTML parsing is notoriously fragile construct. Please consider using a parsing library such as JSoup for this.
Upvotes: 0
Reputation: 347
You regex should be like:
String regex = "<img .*src=\"(.*?)\" .*>";
Upvotes: 0
Reputation: 8932
The .
character usually only matches everything except new line characters. Therefore, your pattern won't match if there are newlines in the img-tag.
Use Pattern.compile(..., Pattern.DOTALL)
or prepend your pattern with (?s)
.
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html#DOTALL
Upvotes: 2