Reputation: 20068
I have the following regular expression which is used to give me the tags in the HTML string:
<[^>]*>
So, if I pass in the following:
<b> Bold </b>
Then it will give me:
<b>
</b>
How can I make it to give me:
<b>
Bold
</b>
UPDATE:
Here is another example to get the big picture:
If this is the text:
<b>Bold</b> This is the stuff <i>Italic</i>
then the final result would be the following:
matches[0] = <b>
matches[1] = Bold
matches[2] = </b>
matches[3] = This is the stuff
matches[4] = <i>
matches[5] = Italic
matches[6] = </i>
Upvotes: 0
Views: 526
Reputation: 21470
I second the advice not to use reglar expressions; html can't be properly expressed using a regular language.
Better to investigate System.Xml.XmlReader and System.Web.HtmlWriter. You should be able to write a function that reads an element from a reader then writes it to a writer; something along the lines of
public static string HtmlReformat(string html)
{
var sw = new StringWriter();
HtmlTextWriter htmlWriter = new HtmlTextWriter(sw);
XmlReader rdr = XmlReader.Create(new StringReader(html));
while (rdr.Read())
{
switch (rdr.NodeType)
{
case XmlNodeType.EndElement:
htmlWriter.WriteEndTag(rdr.Name);
htmlWriter.Write(System.Environment.NewLine);
break;
case XmlNodeType.Element:
htmlWriter.WriteBeginTag(rdr.Name);
for (int attributeIdx = 0; attributeIdx < rdr.AttributeCount; attributeIdx++)
{
string attribName = rdr.GetAttribute(attributeIdx);
htmlWriter.WriteAttribute(rdr.Name, attribName);
}
htmlWriter.Write(">");
htmlWriter.Write(System.Environment.NewLine);
break;
case XmlNodeType.Text:
htmlWriter.Write(rdr.Value);
break;
default:
throw new NotImplementedException("Handle " + rdr.NodeType);
}
}
return sw.ToString();
}
This should give you a base to work from, anyway.
Upvotes: 1
Reputation: 19620
If the input is XHTML, then it's also legal XML, so you can do all this with some simple XSLT.
Upvotes: 2
Reputation: 36397
Html Tags are some of the biggest pains for Regex. You have to be careful because simply matching first and last tag won't be enough if you have more than one tag on the same line, or depending on how you evaluate it, anywhere in the string you're evaluating.
Here is a decent expression you can use...
@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>"
You will have named groups tag and text that you can use to access the values you have. With those values you can format your. Depending on your language, you may need to specify that you want to search the entire string as a single line.
Upvotes: -2
Reputation: 7831
If what you are using to regular expressions supports backward references you can use <(.*?)>.*?</\1>
. This works in perl.
Upvotes: -1
Reputation: 60398
Do not use regular expressions to parse HTML. HTML is not regular, and therefore regex is not at all suited to parsing it. Use an HTML or XML parser instead. There are many (HT|X)ML parsers available online. What language are you using?
You're not going to be able to create a regular expression that matches HTML because of the complexity of the language. Regex operates on a class of languages smaller than the class HTML is a member of. Any regex you try to write will be hard to understand and incorrect.
Use something like XPath instead.
EDIT: You're using C#. Luckily you have an entire System.Xml namespace available to you. Also, there are other libraries for parsing HTML specifically if your HTML is not strict.
Upvotes: 11