Reputation: 66
Following code reads content from UTF-8 file, counts letters in the each string (using hash) and outputs result to another file.
The problem is that each letter in hash are represented as utf-8 code, not symbol.
How correctly use symbols, not their utf-8 code in the hash?
#encoding=utf-8
#Encoding.filesystem = "UTF-8"
length = 0
letters = Hash.new(0)
File.open("hist_1.txt", "r:UTF-8") do |f|
f.each_line do |line|
line.chomp
line.chars.each do |l|
l = l.encode("Windows-1251", "UTF-8")
letters[l] += 1
end
end
end
end
a = letters.sort
#
#p a
puts a
File.open("results.txt", "w:UTF-8") { |file| file.write a }
Upvotes: 0
Views: 672
Reputation: 121000
I believe that while your desired output is about letters, you should not use neither chars
, nor codepoints
:
content = File.read("hist_1.txt", encoding: "UTF-8") # read the whole file
# ⇓⇓⇓⇓⇓⇓⇓⇓⇓ split to letters
content.split(//).inject(Hash.new(0)) do |memo, cp|
memo[cp] += 1
memo
end
#⇒ {"В"=>1, "о"=>5, "т"=>5, " "=>9, "п"=>4, "э"=>1, ...}
Now you probably want to call downcase
on hash key and get rid of spaces and punctuation.
Upvotes: 1