Reputation: 8201
I am looking for a regular expression that can get me src (case insensitive) tag from following HTML snippets in java.
<html><img src="kk.gif" alt="text"/></html>
<html><img src='kk.gif' alt="text"/></html>
<html><img src = "kk.gif" alt="text"/></html>
Upvotes: 14
Views: 37548
Reputation: 8562
This answer is for google searchers, Because it's too late
Copying cletus's showed error and
Modifying his answer and passing modified String src\\s*=\\s*([\"'])?([^\"']*)
as parameter passed into Pattern.compile
worked for me,
Here is the full example
String htmlString = "<div class=\"current\"><img src=\"img/HomePageImages/Paris.jpg\"></div>"; //Sample HTML
String ptr= "src\\s*=\\s*([\"'])?([^\"']*)";
Pattern p = Pattern.compile(ptr);
Matcher m = p.matcher(htmlString);
if (m.find()) {
String src = m.group(2); //Result
}
Upvotes: 1
Reputation: 7191
One possibility:
String imgRegex = "<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>";
is a possibility (if matched case-insensitively). It's a bit of a mess, and deliberately ignores the case where quotes aren't used. To represent it without worrying about string escapes:
<img[^>]+src\s*=\s*['"]([^'"]+)['"][^>]*>
This matches:
<img
>
(i.e. possible other attributes)src
=
'
or "
>
(more possible attributes)>
to close the tagThings to note:
src=
as well, move the open bracket further left :-)>
or image sources that include '
or "
).Upvotes: 27
Reputation: 625077
This question comes up a lot here.
Regular expressions are a bad way of handling this problem. Do yourself a favour and use an HTML parser of some kind.
Regexes are flaky for parsing HTML. You'll end up with a complicated expression that'll behave unexpectedly in some corner cases that will happen otherwise.
Edit: If your HTML is that simple then:
Pattern p = Pattern.compile("src\\s*=\\s*([\\"'])?([^ \\"']*)");
Matcher m = p.matcher(str);
if (m.find()) {
String src = m.group(2);
}
And there are any number of Java HTML parsers out there.
Upvotes: 18
Reputation: 51311
You mean the src-attribute of the img-Tag? In that case you can go with the following:
<[Ii][Mm][Gg]\\s*([Ss][Rr][Cc]\\s*=\\s*[\"'].*?[\"'])
That should work. The expression src='...' is in parantheses, so it is a matcher-group and can be processed separately.
Upvotes: 0