Krishna Kumar
Krishna Kumar

Reputation: 8201

Regular expression to get an attribute from HTML tag

I am looking for a regular expression that can get me src (case insensitive) tag from following HTML snippets in java.

<html><img src="kk.gif" alt="text"/></html>
<html><img src='kk.gif' alt="text"/></html>
<html><img src = "kk.gif" alt="text"/></html>

Upvotes: 14

Views: 37548

Answers (4)

Shree Krishna
Shree Krishna

Reputation: 8562

This answer is for google searchers, Because it's too late

Copying cletus's showed error and Modifying his answer and passing modified String src\\s*=\\s*([\"'])?([^\"']*) as parameter passed into Pattern.compile worked for me,

Here is the full example

    String htmlString = "<div class=\"current\"><img src=\"img/HomePageImages/Paris.jpg\"></div>"; //Sample HTML

    String ptr= "src\\s*=\\s*([\"'])?([^\"']*)";
    Pattern p = Pattern.compile(ptr);
    Matcher m = p.matcher(htmlString);
    if (m.find()) {
        String src = m.group(2); //Result
    }

Upvotes: 1

DMI
DMI

Reputation: 7191

One possibility:

String imgRegex = "<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>";

is a possibility (if matched case-insensitively). It's a bit of a mess, and deliberately ignores the case where quotes aren't used. To represent it without worrying about string escapes:

<img[^>]+src\s*=\s*['"]([^'"]+)['"][^>]*>

This matches:

  • <img
  • one or more characters that aren't > (i.e. possible other attributes)
  • src
  • optional whitespace
  • =
  • optional whitespace
  • starting delimiter of ' or "
  • image source (which may not include a single or double quote)
  • ending delimiter
  • although the expression can stop here, I then added:
    • zero or more characters that are not > (more possible attributes)
    • > to close the tag

Things to note:

  • If you want to include the src= as well, move the open bracket further left :-)
  • This does not care about delimiter balancing or attribute values without delimiters, and it can also choke on badly-formed attributes (such as attributes that include > or image sources that include ' or ").
  • Parsing HTML with regular expressions like this is non-trivial, and at best a quick hack that works in the majority of cases.

Upvotes: 27

cletus
cletus

Reputation: 625077

This question comes up a lot here.

Regular expressions are a bad way of handling this problem. Do yourself a favour and use an HTML parser of some kind.

Regexes are flaky for parsing HTML. You'll end up with a complicated expression that'll behave unexpectedly in some corner cases that will happen otherwise.

Edit: If your HTML is that simple then:

Pattern p = Pattern.compile("src\\s*=\\s*([\\"'])?([^ \\"']*)");
Matcher m = p.matcher(str);
if (m.find()) {
  String src = m.group(2);
}

And there are any number of Java HTML parsers out there.

Upvotes: 18

Mnementh
Mnementh

Reputation: 51311

You mean the src-attribute of the img-Tag? In that case you can go with the following:

<[Ii][Mm][Gg]\\s*([Ss][Rr][Cc]\\s*=\\s*[\"'].*?[\"'])

That should work. The expression src='...' is in parantheses, so it is a matcher-group and can be processed separately.

Upvotes: 0

Related Questions