AKS
AKS

Reputation: 4658

Strip specific HTML tags using Notepad++

I'd like to hear if anyone can help to to replace my large XML file's HTML markup.

The XML file has my own schema and it's all fine. But I need to remove <sspan>, <style>, <div> and attributes in <p> tags.

For an example, I need to keep all <ul>, <ol>, <li>, <strong>, <a>, <img> and other tags but remove <div> (with attributes), <span> (with attributes), and attributes in <p> tags.

I have tried many examples from this site and many other sites. But most of them didn't worked.

Upvotes: 2

Views: 7180

Answers (1)

Justin Morgan
Justin Morgan

Reputation: 30715

Quoting from an answer I posted yesterday:

I've heard some very good things about Beautiful Soup, HTML Purifier, and the HTML Agility Pack, which use Python, PHP, and .NET, respectively. Trust me--save yourself some pain and use those instead.

I strongly advise you not to use regex for this. No sane regex is going to work, or probably even come close to working. However, a decent XML parser can do this fairly easily. I'm not sure what programming languages you have access to, but if you can use PHP, .NET or another programming language, you can use the above parsers to find each span, style, div, and p and remove attributes or the entire tags.

jQuery has some good functionality for DOM-manipulation like you're describing, and you can use it to generate HTML which you then cut and paste.

If you absolutely must use regex, you could try this:

  • Pattern: <\s*/?\s*(span|style|div)\b[^>]*?>
  • Replacement: (nothing)

  • Pattern: <\s*p\b[^>]*?>
  • Replacement: <p>

Upvotes: 5

Related Questions