Reputation: 67
hi first sorry for my English
i need to remove one specific HTML tag not all tags
this the tag i want to remove
xxx
<object data="/dictionary/flash/SpeakerApp16.swf" type="application/x-shockwave-flash" width=" 16" height="16" id="pronunciation"> <param name="movie" value="/dictionary/flash/SpeakerApp16.swf"><param name="flashvars" value="sound_name=http%3A%2F%2Fwww.gstatic.com%2Fdictionary%2Fstatic%2Fsounds%2Fde%2F0%2Fman.mp3"><param name="wmode" value="transparent"><a href="http://www.gstatic.com/dictionary/static/sounds/de/0/man.mp3"><img border="0" width="16" height="16" src="/dictionary/flash/SpeakerOffA16.png" alt="listen"></a> </object>
yyy
i want the result xxx yyy
Upvotes: 1
Views: 821
Reputation: 39338
Why use regex when you can simply use IndexOf?
string html = "...";
int start;
while ((start = html.IndexOf("<object")) >=0)
{
int end = html.IndexOf("</object>", start);
html = html.Remove(start, end-start + "</object>".Length);
}
// now 'html' contains the html without object tags
Explanation:
<object
Upvotes: 1
Reputation: 447
Although others are right that this would be easier using DOM methods, if you can't manipulate the DOM and your HTML is effectively just a string, then you can do this (assuming C#):
string resultString = null;
try {
resultString = Regex.Replace(subjectString,
@"\s+<(object)\b[^>]*>(?:[^<]|<(?!/\1))*</\1>\s*", " ", RegexOptions.IgnoreCase);
} catch (ArgumentException ex) {
// Error catching
}
This assumes that <object
is the only part of this that might not change and that the tag is always closed with </object>
.
EDIT: Explanation: The regex searches for any white space, then for <object
, then it looks for anything that is not a closing angle bracket, followed by the closing angle bracket of object, then it searches for anything that is not an open-angle bracket or anything that is an open-angle bracket not followed by /object
(referred to via backreference \1
), as many times as possible, followed by </object>
(using backreference \1
again), and finally any white space. It then replaces what has been matched with a single space.
EDIT2: For efficiency, I used \s+
at the beginning of the regex, which means it will only match if there is at least one whitespace character (which can include newline) before <object
. However, if your original HTML could have, say, xxx<object
(e.g., HTML string is minified) then change \s+
to \s*
. Whether \s+
or \s*
is more efficient depends on how optimized the C# regex engine is in the version/system/OS you're targetting. So experiment to find out which matches faster.
EDIT3: The regex can be further simplified to this: \s+<(object)\b(?:[^<]|<(?!/\1))*</\1>\s*
.
Upvotes: 1
Reputation: 48337
If you know exactly what the tag will be, a non-regex search and replace will be faster and more efficient. How much do you know of the tag's form?
Also, regex & HTML is a Bad Thing.
Upvotes: 1