Peter
Peter

Reputation: 132307

Change unicode characters in UTF-8 from symbol to numeric code

I'm using Ruby (and Nokogiri in case that is helpful) to encode some documents. I want to change actual unicode characters (like ) to html entities (like “). How do I do this? I know I can do a single character with something like

s = '“'    
puts "&##{.unpack('U').first};"   # gives “

but is there a way to do this properly using iconv or nokogiri?

Upvotes: 0

Views: 576

Answers (3)

pguardiario
pguardiario

Reputation: 54992

It may not be proper but nokogiri does this (libxml2 I think actually) when it doesn't understand the encoding:

Nokogiri::HTML(html,nil,'klingon')

Upvotes: 1

steenslag
steenslag

Reputation: 80075

There is the HTMLEntities gem. For it's decimal encoding it does about the same as your code (unpack).

Upvotes: 1

Peter
Peter

Reputation: 132307

I've come up with this method, which takes a quite brute-force approach which is surely (hopefully?) replaced by a compiled library solution? It works though:

def clean(text)
  # Convert html chars to entities.
  text = text.gsub(/[^\u{20}-\u{7E}]/){|char| "&##{char.unpack('U')[0]};"}
end

Upvotes: 0

Related Questions