Reputation: 2056
I have a situation where a Nokogiri
result has hex
encoding into my results. The problem is where the actual encoding of the result is UTF-8
, but contains hex characters:
Best 100+ Fishing Pictures | Download Free Images on Unsplash
https%3A%2F%2Funsplash.com%2Fs%2Fphotos%2Ffishing&rut=d1dd8233a6ad628121fa36d8d5a51be0b6fb0eda75e234d5036bf7b49efcf25b
current encoding: UTF-8
Fish Images | Free Vectors, Stock Photos & PSD
https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68
current encoding: UTF-8
How to Use Fish vs. fishes Correctly
https%3A%2F%2Fgrammarist.com%2Fusage%2Ffish%2Dfishes%2F&rut=e0897e219c9b0b125a1442b59e36c49753417a1b7812ae9d3ab0bc3179ffe6b5
current encoding: UTF-8
The URLs are technically encoded as UTF-8
, but have hex characters. I haven't found anything that has seen them as hex to translate to UTF-8
, so I'm lost as to how to recognize those character groupings for translation. Outside of writing a complex method that might work, I thought I would see if there's a force-recognition of the original string to be then translated using force_encode
or something of that sort.
Anybody have any advice how to accomplish this? Any insight appreciated. I'd rather avoid having to hand-code these characters into a method.
Update:
CGI::unescapeHTML(<string>]
isn't working:
irb(main):024:0> a
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):025:0> CGI::unescapeHTML(a)
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):026:0> CGI::unescapeHTML(a) == a
=> true
Upvotes: 1
Views: 80
Reputation: 2238
You didn't give the source for your "encoding of the result is UTF-8, but contains hex characters" in the original question. I don't think I understand that question.
In your update, you used the incorrect method. unescapeHTML
is for resolving HTML entities:
irb(main):010:0> CGI.escapeHTML '<'
=> "<"
irb(main):012:0> CGI.unescapeHTML '<'
=> "<"
The method you need to use is for decoding URL sequences:
irb(main):017:0> encoded_url = "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):018:0> CGI.unescape encoded_url
=> "https://www.freepik.com/free-photos-vectors/fish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
If that doesn't solve your actual problem, I'm happy to revise given a more debuggable source code in the question.
Upvotes: 1