Reputation: 3316
I'm making a simple sinatra based web app to display chinese text, and I know enough about encoding to know that I can potentially lose information if I don't do it properly, but I feel a bit lost in the space of encoding. It's also the first time I'm working with non-english based text in ruby.
Are there any areas in particular that I have to be careful about within my programming stack? Also are there extra libraries I should know about to ensure I encode/decode properly?
My programming stack currently consists of:
Upvotes: 0
Views: 1180
Reputation: 42863
I've bee screen scraping Chinese characters for a few months at http://sinograms.com. I'm using rails3, ruby 1.9.2, and heroku.
I found no encoding issues, however I'm only accepting unicode characters. UTF is the same thing as unicode except that it is backwards compatible with ASCII so if you stick with that you should be find.
This is the best resource I found for ruby and encoding:
http://blog.grayproductions.net/articles/ruby_19s_string
You can check if the Chinese Character is unicode with the following script:
def check(char)
char = char.unpack('U*').first
if char >= 0x4E00 && char <= 0x9FFF
return true
end
if char >= 0x3400 && char <= 0x4DBF
return true
end
if char >= 0x20000 && char <= 0x2A6DF
return true
end
if char >= 0x2A700 && char <= 0x2B73F
return true
end
return false
end
Upvotes: 0
Reputation: 78523
The best post I've read on the ruby charset implementation was written by one of the guys behind most of the code involved:
http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html
I ran into it while looking into ICU support in ruby:
http://redmine.ruby-lang.org/issues/2034
Upvotes: 1
Reputation: 4459
Ruby works pretty well with UTF8 encoding, so you shouldn't have a problems with it.
But in some cases you should use magic comment #encoding: UTF-8
at the start of your files.
You can read this http://blog.grayproductions.net/articles/understanding_m17n to understand encoding in Ruby.
Upvotes: 1