Regex replacing all ASCII character codes with actual characters

Question

I have a string that looks like:

4000 BCE–5000 BCE and 600 CE–650 CE.

I am trying to use a regex to search through the string, find all character codes and replace all character codes with the corresponding actual characters. For my sample string, I want to end up with a string that looks like

4000 BCE–5000 BCE and 600 CE–650 CE.

I tried writing it in code, but I can't figure out what to write:

string line = "4000 BCE–5000 BCE and 600 CE–650 CE";

listof?datatype matches = search through `line` and find all the matches to  "&#.*?;"

foreach (?datatype match in matches){
    int extractedNumber = Convert.ToInt32(Regex.(/*extract the number that is between the &# and the ?*/));

    //convert the number to ascii character
    string actualCharacter = (char) extractedNumber + "";

    //replace character code in original line
    line = Regex.Replace(line, match, actualCharacter); 
}

Edit

My original string actually has some HTML in it and looks like:

4000 BCE–5000 BCE and 600 CE–650 CE

I used line = Regex.Replace(note, "<.*?>", string.Empty); to remove the tags, but apparently, according to one of the most popular questions on SO, RegEx match open tags except XHTML self-contained tags, you really should not use RegEx to remove HTML.

user557597 · Accepted Answer

How about doing it in a delegate replacement.
edit: As a side note, this is a good regex to remove all tags and script blocks

<(?:script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)?\s*>[\S\s]*?]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:$$CDATA\[[\S\s]*?$$\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

C#:

string line = @"4000 BCE–5000 BCE and 600 CE–650 CE";
Regex RxCode = new Regex(@"&#([0-9]+);");
string lineNew = RxCode.Replace(
    line,
    delegate( Match match ) {
        return "" + (char)Convert.ToInt32( match.Groups[1].Value);
    }
);
Console.WriteLine( lineNew );

Output:

4000 BCE-5000 BCE and 600 CE-650 CE

edit: If you expect the hex form as well, you can handle that too.

 #  @"&\#(?:([0-9]+)|x([0-9a-fA-F]+));"

 &\#
 (?:
      ( [0-9]+ )                    # (1)
   |  x
      ( [0-9a-fA-F]+ )              # (2)
 )
 ;

C#:

Regex RxCode = new Regex(@"&#(?:([0-9]+)|x([0-9a-fA-F]+));");
string lineNew = RxCode.Replace(
    line,
    delegate( Match match ) {
        return match.Groups[1].Success ? 
            "" + (char)Convert.ToInt32( match.Groups[1].Value ) :
            "" + (char)Int32.Parse( match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
    }
);

Regex replacing all ASCII character codes with actual characters

Edit

Answers (2)

Solution 1: XML-valid input

Solution 2: Non-valid XML input with HTML/XML entities

Solution 3: HtmlAgilityPack and HtmlEntity.DeEntitize

Related Questions