Victor Ch.
Victor Ch.

Reputation: 66

Correctly use UTF-8 chars as hash keys

Following code reads content from UTF-8 file, counts letters in the each string (using hash) and outputs result to another file.

The problem is that each letter in hash are represented as utf-8 code, not symbol.

How correctly use symbols, not their utf-8 code in the hash?

#encoding=utf-8
#Encoding.filesystem = "UTF-8"
length = 0

letters = Hash.new(0)

File.open("hist_1.txt", "r:UTF-8") do |f|
  f.each_line do |line|
    line.chomp
    line.chars.each do |l|
      l = l.encode("Windows-1251", "UTF-8")
      letters[l] += 1 
    end
    end
  end
end

a = letters.sort
#
#p a
puts a
File.open("results.txt", "w:UTF-8") { |file| file.write a }

Upvotes: 0

Views: 672

Answers (1)

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121000

I believe that while your desired output is about letters, you should not use neither chars, nor codepoints:

content = File.read("hist_1.txt", encoding: "UTF-8") # read the whole file
#       ⇓⇓⇓⇓⇓⇓⇓⇓⇓ split to letters
content.split(//).inject(Hash.new(0)) do |memo, cp|
  memo[cp] += 1
  memo
end
#⇒ {"В"=>1, "о"=>5, "т"=>5, " "=>9, "п"=>4, "э"=>1, ...}

Now you probably want to call downcase on hash key and get rid of spaces and punctuation.

Upvotes: 1

Related Questions