Andrew
Andrew

Reputation: 7768

Match and replace string in text using regular expressions

I have a large string and it might have the following:

<div id="Specs" class="plinks">
<div id="Specs" class="plinks2">
<div id="Specs" class="sdfsf">
<div id="Specs" class="ANY-OTHER_NAME">

How can I replace values in the string from anything above to:

<div id="Specs" class="">

this is what I came up with, but it does not work:

        string source = "bunch of text";
        string regex = "<div id=\"Specs\" class=[\"']([^\"']*)[\"']>";
        string regexReplaceTo = "<div id=\"Specs\" class=\"\">";
        string output = Regex.Replace(source, regex, regexReplaceTo); 

Upvotes: 0

Views: 995

Answers (3)

jessehouwing
jessehouwing

Reputation: 114631

If your input isn't XML compliant, which most HTML isn't, then you can use the HTML Agility Pack to parse the HTML and manipulate the contents. With the HTML Agility PAck, combined with Linq or Xpath, the order of your attributes no longer matters (which it does when you use Regex) and the overall stability of your solution increases a lot.

Using the HTML Agility Pack (project page, nuget), this does the trick:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here"); 
// or doc.Load(stream);

var nodes = doc.DocumentNode.DescendantNodes("div").Where(div => div.Id == "Specs");

foreach (var node in nodes)
{
    var classAttribute = node.Attributes["class"];
    if (classAttribute != null)
    {
        classAttribute.Value = string.Empty;
    }
}

var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);

Upvotes: 2

Erik Philips
Erik Philips

Reputation: 54628

Looks like another case of http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html. What happens to the following valid tags with a Regex?

<div class="reversed" id="Specs">            
<div  id="Specs"  class="additionalSpaces" >     
<div id="Specs" class="additionalAttributes" style="" >

I don't see a how using Linq2Xml wouldn't work with any combination:

XElement root = XElement.Parse(xml); // XDocument.Load(xmlFile).Root 
var specsDivs = root.Descendants()
                    .Where(e => e.Name == "div"
                           && e.Attributes.Any(a => a.Name == "id")
                           && e.Attributes.First(a => a.Name == "id").Value == "Specs"
                           && e.Attributes.Any(a => a.Name == "class"));
foreach(var div in specsDivs)
{
  div.Attributes.First(a => a.Name == "class").value = string.Empty;
}
string newXml = root.ToString()    

Upvotes: 4

Dr.Kameleon
Dr.Kameleon

Reputation: 22810

What about...

  • Regex to match : class=\"[A-Za-z0-9_\-]+\"
  • Replace with : class=\"\"

This way, we ignore the first part (id="Specs", etc) and just replace the class name... with nothing.

Upvotes: 4

Related Questions