Draco
Draco

Reputation: 337

Gsub raises "invalid byte sequence in UTF-8"

I have the next method call:

Formatting.git_log_to_html(`git log --no-merges master --pretty=full #{interval}`)

The value of interval is something like release-20130325-01..release-20130327-04.

The git_log_to_html ruby method is the next (I am only pasting the line what raises the error):

module Formatting
  def self.git_log_to_html(git_log)
    ...
    git_log.gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit").each do |commit|
    ...
  end
end

This used to work, but actually I checked that gsub is raising an "invalid byte sequence in UTF-8" error.

Could you help to understand why and how can I fix it? :/

Here is the output of git_log:

https://dl.dropbox.com/u/42306424/output.txt

Upvotes: 4

Views: 4476

Answers (1)

rorra
rorra

Reputation: 9693

For some reason, this command:

git log --no-merges master --pretty=full #{interval}

is giving you a result that is not encoded in UTF-8, it may be that your computer is working with a different charset, try the following:

module Formatting
  def self.git_log_to_html(git_log)
    ...
    git_log.force_encoding("utf8").gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit").each do |commit|
    ...
  end
end

I'm not sure if that will work, but you can try.

If that doesn't work, you can check ruby iconv to detect the charset and encode it on utf-8: http://www.ruby-doc.org/stdlib-2.0/libdoc/iconv/rdoc/


Based on the file you added on the comment, I did:

require 'open-uri'
content = open('https://dl.dropbox.com/u/42306424/output.txt').read
content.gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit")

and worked nice without any kind of troubles


btw, you can try:

require 'iconv'

module Formatting
  def self.git_log_to_html(git_log)
    ...
    git_log = Iconv.conv 'UTF-8', 'iso8859-1', git_log
    git_log.gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit").each do |commit|
    ...
  end
end

but you should really detect the charset of the string before attempting a conversion to utf-8.

Upvotes: 3

Related Questions