Chelicerae
Chelicerae

Reputation: 33

Decode cyrillic HTML entities in C#

I have some string taken from website using HtmlAgilityPack, that contains HTML entities of cyrillic letters

Example:

"Корпус"

Is there any way to decode it into symbols in C# when saving to file? I tried using HttpUtility.HtmlDecode and WebUtility.HtmlDecode of System.Web, but it didn't help.

My attempt:

using System;
using System.Web;

namespace esp
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            body = "Корпус";

            //output will be "Корпус"
            Console.WriteLine(HttpUtility.HtmlDecode(body)); 
        }
    }
}

Upvotes: 3

Views: 380

Answers (1)

Dmitrii Bychenko
Dmitrii Bychenko

Reputation: 186803

Just a guess. As far as I can see, we have the following format:

  &
   Letter(s) - transliterated letter 
   cy        - stands for Cyrillic 
  ; 

We can match all the letters with a help of Regular expressions, and Concat them into a string e.g.

  using System.Text.RegularExpressions;

  ...

  string body = "Корпус";

  var transliteratedText = Regex.Replace(
         body, 
       @"&(?<letter>[A-Za-z]+)cy;",
         m => m.Groups["letter"].Value);

  Console.Write(transliteratedText);

And we'll have

Korpus

which sounds reasonable, since it's transliterated Russian word Корпус (Corpus, Body, Bulk, Carcass). There are several transliteration standards (I've tried Library of Congress scheme which is just one of the most popular); in order to detect the right standard (or create our own) we want more data.

Edit For instance if we have a scheme, say,

private static Dictionary<string, string> translit = 
  new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase) {
  {"a", "а"},
  {"b", "б"},
  {"v", "в"},
  {"g", "г"},
  {"d", "д"},
  {"ie", "е"},
  //{"", "ё"}, //TODO: define the letter transliteration
  {"zh", "ж"},
  {"z", "з"},
  {"i", "и"},
  {"j", "й"},
  {"k", "к"},
  {"l", "л"},
  {"m", "м"},
  {"n", "н"},
  {"o", "о"},
  {"p", "п"},
  {"r", "р"},
  {"s", "с"},
  {"t", "т"},
  {"u", "у"},
  {"f", "ф"},
  {"h", "х"},
  {"ts", "ц"},
  {"ch", "ч"},
  {"sh", "ш"},
  {"shch", "щ"},
  //{"", "ъ"}, //TODO: define the letter transliteration
  {"y", "ы"},
  //{"", "ь"}, //TODO: define the letter transliteration
  //{"", "э"}, //TODO: define the letter transliteration
  //{"", "ю"}, //TODO: define the letter transliteration
  {"ya", "я"},
};

we can transliterate each letter:

private static string MyDecoding(string value) {
  return Regex
    .Replace(value, @"&(?<letter>[A-Za-z]+)cy;", m => {
      string v = m.Groups["letter"].Value;

      return char.IsUpper(v[0])
        ? CultureInfo.InvariantCulture.TextInfo.ToTitleCase(translit[v])
        : translit[v];
      }
    );
}
...

Console.Write(MyDecoding("&Kcy;&ocy;&rcy;&pcy;&ucy;&scy;"));

Outcome:

Корпус

Upvotes: 2

Related Questions