Tot Zam
Tot Zam

Reputation: 8746

Regex replacing all ASCII character codes with actual characters

I have a string that looks like:

4000 BCE–5000 BCE and 600 CE–650 CE.

I am trying to use a regex to search through the string, find all character codes and replace all character codes with the corresponding actual characters. For my sample string, I want to end up with a string that looks like

4000 BCE–5000 BCE and 600 CE–650 CE.

I tried writing it in code, but I can't figure out what to write:

string line = "4000 BCE–5000 BCE and 600 CE–650 CE";

listof?datatype matches = search through `line` and find all the matches to  "&#.*?;"

foreach (?datatype match in matches){
    int extractedNumber = Convert.ToInt32(Regex.(/*extract the number that is between the &# and the ?*/));

    //convert the number to ascii character
    string actualCharacter = (char) extractedNumber + "";

    //replace character code in original line
    line = Regex.Replace(line, match, actualCharacter); 
}

Edit

My original string actually has some HTML in it and looks like:

4000 <small>BCE</small>&#8211;5000 <small>BCE</small> and 600 <small>CE</small>&#8211;650 <small>CE</small>

I used line = Regex.Replace(note, "<.*?>", string.Empty); to remove the <small> tags, but apparently, according to one of the most popular questions on SO, RegEx match open tags except XHTML self-contained tags, you really should not use RegEx to remove HTML.

Upvotes: 0

Views: 4680

Answers (2)

user557597
user557597

Reputation:

How about doing it in a delegate replacement.
edit: As a side note, this is a good regex to remove all tags and script blocks

<(?:script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)?\s*>[\S\s]*?</script\s*|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

C#:

string line = @"4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE";
Regex RxCode = new Regex(@"&#([0-9]+);");
string lineNew = RxCode.Replace(
    line,
    delegate( Match match ) {
        return "" + (char)Convert.ToInt32( match.Groups[1].Value);
    }
);
Console.WriteLine( lineNew );

Output:

4000 BCE-5000 BCE and 600 CE-650 CE

edit: If you expect the hex form as well, you can handle that too.

 #  @"&\#(?:([0-9]+)|x([0-9a-fA-F]+));"

 &\#
 (?:
      ( [0-9]+ )                    # (1)
   |  x
      ( [0-9a-fA-F]+ )              # (2)
 )
 ;

C#:

Regex RxCode = new Regex(@"&#(?:([0-9]+)|x([0-9a-fA-F]+));");
string lineNew = RxCode.Replace(
    line,
    delegate( Match match ) {
        return match.Groups[1].Success ? 
            "" + (char)Convert.ToInt32( match.Groups[1].Value ) :
            "" + (char)Int32.Parse( match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
    }
);

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

You do not need any regex to convert XML entity references to literal strings.

Solution 1: XML-valid input

Here is a solution that assumes you have an XML-valid input.

Add using System.Xml; namespace and use this method:

public string XmlUnescape(string escaped)
{
    XmlDocument doc = new XmlDocument();
    XmlNode node = doc.CreateElement("root");
    node.InnerXml = escaped;
    return node.InnerText;
}

Use it like this:

var output1 = XmlUnescape("4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE.");

Result:

enter image description here

Solution 2: Non-valid XML input with HTML/XML entities

In case you cannot use the XmlDocument with your strings since they contain invalid XML syntax, you can use the following method that uses HttpUtility.HtmlDecode to convert only the entities that are known HTML and XML entities:

public string RevertEntities(string test)
{
   Regex rxHttpEntity = new Regex(@"(&[#\w]+;)"); // Declare a regex (better initialize it as a property/field of a static class for better performance
   string last_res = string.Empty; // a temporary variable holding a previously found entity
   while (rxHttpEntity.IsMatch(test)) // if our input has something like &#101; or &nbsp;
   {
       test = test.Replace(rxHttpEntity.Match(test).Value, HttpUtility.HtmlDecode(rxHttpEntity.Match(test).Value.ToLower())); // Replace all the entity references with there literal value (&amp; => &)
       if (last_res == test) // Check if we made any change to the string
           break; // If not, stop processing (there are some unsupported entities like &ourgreatcompany;
       else
           last_res = test; // Else, go on checking for entities
    }
    return test;
}

Calling this as below:

var output2 = RevertEntities("4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE."); 

Solution 3: HtmlAgilityPack and HtmlEntity.DeEntitize

Download and install using Manage NuGet Packages for Solution an HtmlAgilityPack and use this code to get all text:

public string getCleanHtml(string html)
{
    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
}

And then use

var txt = "4000 <small>BCE</small>&#8211;5000 <small>BCE</small> and 600 <small>CE</small>&#8211;650 <small>CE</small>";
var clean = getCleanHtml(txt);

Result:

enter image description here doc.DocumentNode.InnerText.Substring(doc.DocumentNode.InnerText.IndexOf("\n")).Trim();

You can use LINQ with HtmlAgilityPack, download pages (with var webGet = new HtmlAgilityPack.HtmlWeb(); var doc = webGet.Load(url);), and a lot more. And the best is that there will be no entities to handle manually.

Upvotes: 1

Related Questions