khernik
khernik

Reputation: 2091

Extracting content of HTML tag with specific attribute

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?

This is what I currently have:

<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>

The problem with this is this sample:

<div id="1">test</div><div id="2">test</div>

If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.

How can I do this?

Upvotes: 0

Views: 336

Answers (2)

user557597
user557597

Reputation:

A fairly simple way is to use

Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>

Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/

Use the variable in place of 2.

The content will be in group 1.

Upvotes: 1

Barmar
Barmar

Reputation: 782498

Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.

<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>

Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.

However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

Upvotes: 0

Related Questions