Reputation: 8746
I have a string that looks like:
4000 BCE–5000 BCE and 600 CE–650 CE
.
I am trying to use a regex to search through the string, find all character codes and replace all character codes with the corresponding actual characters. For my sample string, I want to end up with a string that looks like
4000 BCE–5000 BCE and 600 CE–650 CE
.
I tried writing it in code, but I can't figure out what to write:
string line = "4000 BCE–5000 BCE and 600 CE–650 CE";
listof?datatype matches = search through `line` and find all the matches to "&#.*?;"
foreach (?datatype match in matches){
int extractedNumber = Convert.ToInt32(Regex.(/*extract the number that is between the &# and the ?*/));
//convert the number to ascii character
string actualCharacter = (char) extractedNumber + "";
//replace character code in original line
line = Regex.Replace(line, match, actualCharacter);
}
My original string actually has some HTML in it and looks like:
4000 <small>BCE</small>–5000 <small>BCE</small> and 600 <small>CE</small>–650 <small>CE</small>
I used line = Regex.Replace(note, "<.*?>", string.Empty);
to remove the <small>
tags, but apparently, according to one of the most popular questions on SO, RegEx match open tags except XHTML self-contained tags, you really should not use RegEx to remove HTML.
Upvotes: 0
Views: 4680
Reputation:
How about doing it in a delegate replacement.
edit: As a side note, this is a good regex to remove all tags and script blocks
<(?:script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)?\s*>[\S\s]*?</script\s*|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
C#:
string line = @"4000 BCE–5000 BCE and 600 CE–650 CE";
Regex RxCode = new Regex(@"&#([0-9]+);");
string lineNew = RxCode.Replace(
line,
delegate( Match match ) {
return "" + (char)Convert.ToInt32( match.Groups[1].Value);
}
);
Console.WriteLine( lineNew );
Output:
4000 BCE-5000 BCE and 600 CE-650 CE
edit: If you expect the hex form as well, you can handle that too.
# @"&\#(?:([0-9]+)|x([0-9a-fA-F]+));"
&\#
(?:
( [0-9]+ ) # (1)
| x
( [0-9a-fA-F]+ ) # (2)
)
;
C#:
Regex RxCode = new Regex(@"&#(?:([0-9]+)|x([0-9a-fA-F]+));");
string lineNew = RxCode.Replace(
line,
delegate( Match match ) {
return match.Groups[1].Success ?
"" + (char)Convert.ToInt32( match.Groups[1].Value ) :
"" + (char)Int32.Parse( match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
}
);
Upvotes: 2
Reputation: 626927
You do not need any regex to convert XML entity references to literal strings.
Here is a solution that assumes you have an XML-valid input.
Add using System.Xml;
namespace and use this method:
public string XmlUnescape(string escaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerXml = escaped;
return node.InnerText;
}
Use it like this:
var output1 = XmlUnescape("4000 BCE–5000 BCE and 600 CE–650 CE.");
Result:
In case you cannot use the XmlDocument
with your strings since they contain invalid XML syntax, you can use the following method that uses HttpUtility.HtmlDecode
to convert only the entities that are known HTML and XML entities:
public string RevertEntities(string test)
{
Regex rxHttpEntity = new Regex(@"(&[#\w]+;)"); // Declare a regex (better initialize it as a property/field of a static class for better performance
string last_res = string.Empty; // a temporary variable holding a previously found entity
while (rxHttpEntity.IsMatch(test)) // if our input has something like e or
{
test = test.Replace(rxHttpEntity.Match(test).Value, HttpUtility.HtmlDecode(rxHttpEntity.Match(test).Value.ToLower())); // Replace all the entity references with there literal value (& => &)
if (last_res == test) // Check if we made any change to the string
break; // If not, stop processing (there are some unsupported entities like &ourgreatcompany;
else
last_res = test; // Else, go on checking for entities
}
return test;
}
Calling this as below:
var output2 = RevertEntities("4000 BCE–5000 BCE and 600 CE–650 CE.");
Download and install using Manage NuGet Packages for Solution an HtmlAgilityPack and use this code to get all text:
public string getCleanHtml(string html)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
}
And then use
var txt = "4000 <small>BCE</small>–5000 <small>BCE</small> and 600 <small>CE</small>–650 <small>CE</small>";
var clean = getCleanHtml(txt);
Result:
doc.DocumentNode.InnerText.Substring(doc.DocumentNode.InnerText.IndexOf("\n")).Trim();
You can use LINQ with HtmlAgilityPack, download pages (with var webGet = new HtmlAgilityPack.HtmlWeb(); var doc = webGet.Load(url);
), and a lot more. And the best is that there will be no entities to handle manually.
Upvotes: 1