Reputation: 67

RegularExpressions to Remove specific HTML tag

hi first sorry for my English

i need to remove one specific HTML tag not all tags

this the tag i want to remove

xxx

<object data="/dictionary/flash/SpeakerApp16.swf" type="application/x-shockwave-flash" width=" 16" height="16" id="pronunciation"> <param name="movie" value="/dictionary/flash/SpeakerApp16.swf"><param name="flashvars" value="sound_name=http%3A%2F%2Fwww.gstatic.com%2Fdictionary%2Fstatic%2Fsounds%2Fde%2F0%2Fman.mp3"><param name="wmode" value="transparent"><a href="http://www.gstatic.com/dictionary/static/sounds/de/0/man.mp3"><img border="0" width="16" height="16" src="/dictionary/flash/SpeakerOffA16.png" alt="listen"></a> </object>

yyy

i want the result xxx yyy

Upvotes: 1

Answers (3)

Hans Keﬆing

Reputation: 39338

Why use regex when you can simply use IndexOf?

string html = "...";
int start;
while ((start = html.IndexOf("<object")) >=0)
{
    int end = html.IndexOf("</object>", start);
    html = html.Remove(start, end-start + "</object>".Length);
}
// now 'html' contains the html without object tags

Explanation:

Find the first occurrence of <object
Find the start of the next closing tag
Remove that part including the whole closing tag
Repeat until no object tags are left

Upvotes: 1

Jaifroid

Reputation: 447

Although others are right that this would be easier using DOM methods, if you can't manipulate the DOM and your HTML is effectively just a string, then you can do this (assuming C#):

string resultString = null;
try {
    resultString = Regex.Replace(subjectString, 
        @"\s+<(object)\b[^>]*>(?:[^<]|<(?!/\1))*</\1>\s*", " ", RegexOptions.IgnoreCase);
} catch (ArgumentException ex) {
    // Error catching
}

This assumes that <object is the only part of this that might not change and that the tag is always closed with </object>.

EDIT: Explanation: The regex searches for any white space, then for <object, then it looks for anything that is not a closing angle bracket, followed by the closing angle bracket of object, then it searches for anything that is not an open-angle bracket or anything that is an open-angle bracket not followed by /object (referred to via backreference \1), as many times as possible, followed by </object> (using backreference \1 again), and finally any white space. It then replaces what has been matched with a single space.

EDIT2: For efficiency, I used \s+ at the beginning of the regex, which means it will only match if there is at least one whitespace character (which can include newline) before <object. However, if your original HTML could have, say, xxx<object(e.g., HTML string is minified) then change \s+ to \s*. Whether \s+ or \s* is more efficient depends on how optimized the C# regex engine is in the version/system/OS you're targetting. So experiment to find out which matches faster.

EDIT3: The regex can be further simplified to this: \s+<(object)\b(?:[^<]|<(?!/\1))*</\1>\s*.

Upvotes: 1

ssube

Reputation: 48337

If you know exactly what the tag will be, a non-regex search and replace will be faster and more efficient. How much do you know of the tag's form?

Also, regex & HTML is a Bad Thing.

Upvotes: 1

RegularExpressions to Remove specific HTML tag

Answers (3)

Related Questions