Dave
Dave

Reputation: 19320

In Ruby, how do I deal with non-UTF 8 characters in PDF content?

I’m using Rails 4.2.7. I’m downloading and writing PDF content from the web, like so …

    res1 = Net::HTTP.SOCKSProxy('127.0.0.1', 50001).start(uri.host, uri.port) do |http|
      puts "launching #{uri}"
      resp = http.get(uri)
      status = resp.code
      content = resp.body
      content_type = resp['content-type']
      content_encoding = resp['content-encoding']
    end
…
  if content_type == 'application/pdf' || content_type.include?('application/x-javascript')
    File.open(file_location, "w") { |file| file.write content }

I’m noticing that for some content, I get the below error

Error during processing: "\xC2" from ASCII-8BIT to UTF-8
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `write'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `block in pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `open'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:76:in `process_race_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_race_finder_service.rb:75:in `process_race_link'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:29:in `block in process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `each'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `process_data'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:18:in `block in run_all_crawlers'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each'

I tried accounting for it, by replacing invalid characters, like so …

File.open(file_location, "w") { |file| file.write content }
content.encode('UTF-8', :invalid => :replace, :undef => :replace)

but then I get the error

error: PDF malformed, expected 'endstream' but found 0 instead

when trying to read the PDF file. Does anyone know of a better way to deal with downloaded PDF docs that won’t corrupt them?

Upvotes: 1

Views: 1288

Answers (1)

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121010

I think the easiest solution would be to write it as is using IO#binwrite:

File.binwrite(file_location, content)

The above might fail, if files you receive might be in different encodings, In that case I would try to

content.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8)

Upvotes: 1

Related Questions