Abhishek Mathur
Abhishek Mathur

Reputation: 316

How to convert HTML to Javascript escapes in C#

I have converted a Hindi font to HTML code. Now what I want is to convert this HTML code to unicode escapes...

Hindi:

श्रीगंगानगर। हनुमानगढ़ मार्ग पर लालगढ़ जाटान छावनी के नजदीक शनिवार सुबह सड़क से पन्द्रह-...

Corresponding HTML:

श्रीगंगानगर। हनुमानगढ़ मार्ग पर लालगढ़ जाटान छावनी के नजदीक शनिवार सुबह सड़क से पन्द्रह-...

Now I want to convert this HTML code to unicode escapes like:

\u0936\u094D\u0930\u0940\u0917\u0902\u0917\u093E\u0928\u0917\u0930\u0964 \u0939\u0928\u0941\u092E\u093E\u0928\u0917\u0922\u093C \u092E\u093E\u0930\u094D\u0917 \u092A\u0930

Just like in this site. But I want this conversion through C# code, not in Javascript...

Upvotes: 1

Views: 470

Answers (4)

Joachim Isaksson
Joachim Isaksson

Reputation: 180877

I see you got multiple answers directly from the raw text, here's a way to do it from your HTML escapes as you asked;

string input = "श्रीगंग..."

var output = Regex.Replace(input, @"&#([0-9]*);", 
               x => String.Format("\\u{0:X4}", int.Parse(x.Groups[1].Value)));

or alternately;

var output = String.Join("", WebUtility.HtmlDecode(input)
                   .Select(x => "\\u" + ((int)x).ToString("X4")));

Upvotes: 0

Jon Hanna
Jon Hanna

Reputation: 113232

StringBuilder sb = new StringBuilder(hindiString.Length * 6);
foreach(char c in hindiString)
  sb.Append(@"\u").Append(((int)c).ToString("X4"));
return sb.ToString()

I'm assuming you don't need to worry about anything outside of the BMP. If so you want to merge together UTF-16 high and low surrogates first. Edit: Scratch that last sentence, js uses UTF-16 internally the same as C#, so the above will work fine outside the BMP too.

However, the corresponding HTML to श्रीगंगानगर। हनुमानगढ़ मार्ग पर लालगढ़ जाटान छावनी के नजदीक शनिवार सुबह सड़क से पन्द्रह is:

<p>श्रीगंगानगर। हनुमानगढ़ मार्ग पर लालगढ़ जाटान छावनी के नजदीक शनिवार सुबह सड़क से पन्द्रह</p>

And the corresponding javascript is:

"श्रीगंगानगर। हनुमानगढ़ मार्ग पर लालगढ़ जाटान छावनी के नजदीक शनिवार सुबह सड़क से पन्द्रह"

Or:

'श्रीगंगानगर। हनुमानगढ़ मार्ग पर लालगढ़ जाटान छावनी के नजदीक शनिवार सुबह सड़क से पन्द्रह'

Why not just use them?

Upvotes: 1

Kishore Kumar
Kishore Kumar

Reputation: 12864

StringBuilder sb = new StringBuilder();
foreach(char c in hindi)
{
    sb.Append(@"\u").Append(((int)c).ToString("X4"));
}
return sb.ToString()

Upvotes: 0

Sufian Latif
Sufian Latif

Reputation: 13356

You can

  • capture each unicode character using the regular expression &#([0-9]+);
  • convert the captured part into an integer
  • take the hexadecimal representation of the integer in a string
  • add \u at the beginning and pad the string by 0 from left to make it a 4-character string

Upvotes: 0

Related Questions