anonn023432
anonn023432

Reputation: 3120

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.

Here is my code:

def file_read(filename)
  File.open(filename, 'r').read
end

puts f = file_read('alice_in_wonderland.txt')

This works perfectly. But when I add the method line_cutter like this:

def file_read(filename)
  File.open(filename, 'r').read
end

def line_cutter(file)
  file.scan(/\w/)
end

puts f = line_cutter(file_read('alice_in_wonderland.txt'))

I get an error:

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?

Link to the file: File

Upvotes: 6

Views: 2771

Answers (2)

cremno
cremno

Reputation: 4927

The linked text file contains the following line:

Character set encoding: ISO-8859-1

If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:

s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding  # => #<Encoding:ISO-8859-1>

Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):

s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding  # => #<Encoding:UTF-8>

Upvotes: 7

JLB
JLB

Reputation: 323

It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:

require 'net/http'

uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

Upvotes: 2

Related Questions