Luiz E.
Luiz E.

Reputation: 7249

how to decode UTF-8 to HTML tags

I have an HTML document saved in my database as follow:

\\u003cp style=\\\"text-align: center; opacity: 1;\\\"\\u003e\\u003cstrong\\u003e\\u003cspan style=\\\"font-size: 18pt;\\\

I know, it is ugly and I know, it is not the desired way but this is a legacy system.

My task is to get all these HTMLs and convert them to a document in Google Docs. Actually, Google Docs can parse HTML to their internal format pretty good but the HTML needs to be a valid HTML, with <p> instead of \\u003cp.

I'm trying to convert/decode/parse/whatever this string to a valid HTML but so far, without any luck.

Things I already tried

htmlentities gem, CGI decode, Nokogiri::HTML.parse, JSON.parse and none of them did the job.

I also tried string.encode(xxxx) but also without luck. I was really hoping that .encode method would do it but I couldn't make it work, maybe I'm using the wrong encoding? (I tried use all of ISO-xxx encodings)

Upvotes: 0

Views: 711

Answers (2)

Tom Lord
Tom Lord

Reputation: 28305

Here's a quick workaround for you:

input_string.gsub(/\\u(\h{4})/) { [$1.to_i(16)].pack('U') }

With the example input you gave above, this results in:

"<p style=\\\"text-align: center; opacity: 1;\\\"><strong><span style=\\\"font-size: 18pt;\\"

Explanation:

\u003c == <. The left hand side is an escaped unicode character; this is not the same thing as \\u003c, which is a literal backslash followed by u003c.

The regular expression \\u(\h{4}) will match any occurrences of this (\h stands for "hexadecimal" and is equivalent to [0-9a-fA-F]), and Array#pack converts the binary sequence into (in this case) a UTF-8 character.


Ideally of course, you'd solve the problem at its root rather than retro-fit a workaround like this. But if that's outside of your control, then a workaround will have to suffice.

Upvotes: 1

mechnicov
mechnicov

Reputation: 15258

Using Array#pack:

string = "\\u003cp style=\\\"text-align: center; opacity: 1;\\\"\\u003e\\u003cstrong\\u003e\\u003cspan style=\\\"font-size: 18pt;\\"

string.gsub(/\\u(....)/) { [$1.hex].pack("U") }
# => "<p style=\\\"text-align: center; opacity: 1;\\\"><strong><span style=\\\"font-size: 18pt;\\"

Upvotes: 1

Related Questions