Reputation: 33
I have some string
taken from website using HtmlAgilityPack
, that contains HTML entities of cyrillic letters
Example:
"Корпус"
Is there any way to decode it into symbols in C# when saving to file? I tried using HttpUtility.HtmlDecode
and WebUtility.HtmlDecode
of System.Web
, but it didn't help.
My attempt:
using System;
using System.Web;
namespace esp
{
class MainClass
{
public static void Main(string[] args)
{
body = "Корпус";
//output will be "Корпус"
Console.WriteLine(HttpUtility.HtmlDecode(body));
}
}
}
Upvotes: 3
Views: 380
Reputation: 186803
Just a guess. As far as I can see, we have the following format:
&
Letter(s) - transliterated letter
cy - stands for Cyrillic
;
We can match all the letters with a help of Regular expressions, and Concat
them into a string
e.g.
using System.Text.RegularExpressions;
...
string body = "Корпус";
var transliteratedText = Regex.Replace(
body,
@"&(?<letter>[A-Za-z]+)cy;",
m => m.Groups["letter"].Value);
Console.Write(transliteratedText);
And we'll have
Korpus
which sounds reasonable, since it's transliterated Russian word Корпус (Corpus
, Body
, Bulk
, Carcass
). There are several transliteration standards (I've tried Library of Congress scheme which is just one of the most popular); in order to detect the right standard (or create our own) we want more data.
Edit For instance if we have a scheme, say,
private static Dictionary<string, string> translit =
new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase) {
{"a", "а"},
{"b", "б"},
{"v", "в"},
{"g", "г"},
{"d", "д"},
{"ie", "е"},
//{"", "ё"}, //TODO: define the letter transliteration
{"zh", "ж"},
{"z", "з"},
{"i", "и"},
{"j", "й"},
{"k", "к"},
{"l", "л"},
{"m", "м"},
{"n", "н"},
{"o", "о"},
{"p", "п"},
{"r", "р"},
{"s", "с"},
{"t", "т"},
{"u", "у"},
{"f", "ф"},
{"h", "х"},
{"ts", "ц"},
{"ch", "ч"},
{"sh", "ш"},
{"shch", "щ"},
//{"", "ъ"}, //TODO: define the letter transliteration
{"y", "ы"},
//{"", "ь"}, //TODO: define the letter transliteration
//{"", "э"}, //TODO: define the letter transliteration
//{"", "ю"}, //TODO: define the letter transliteration
{"ya", "я"},
};
we can transliterate each letter:
private static string MyDecoding(string value) {
return Regex
.Replace(value, @"&(?<letter>[A-Za-z]+)cy;", m => {
string v = m.Groups["letter"].Value;
return char.IsUpper(v[0])
? CultureInfo.InvariantCulture.TextInfo.ToTitleCase(translit[v])
: translit[v];
}
);
}
...
Console.Write(MyDecoding("Корпус"));
Outcome:
Корпус
Upvotes: 2