Tass
Tass

Reputation: 1628

Ignore � (non-UTF-8 characters) in email attachment or strip them from the attachment?

Users of our application are able to upload plain text files. These files might then be added as attachments to outgoing ActionMailer emails. Recently an attempt to send said email resulted in an invalid byte sequence in UTF-8 error. The email was not sent. This symbol, �, appears throughout the offending attachment.

We're using ActionMailer so although it ought to go without saying, here's representative code for the attachment action within the mailer class's method:

attachments['file-name.jpg'] = File.read('file-name.jpg')

From a business standpoint we don't care about the content of these text files. Ideally I'd like for our application to ignore the content and simply attach them to emails.

Is it possible to somehow tell Rails / ActionMailer to ignore the formatting? Or should I parse the incoming text file, stripping out non-UTF-8 characters?

I did search through like questions here on Stack Overflow but nothing addressed the problem I'm currently facing.

Edit: I did call #readlines on the file in a Rails console and found that the black diamond is a representation of \xA0. This is likely a non-breaking space in Latin1 (ISO 8859-1).

Upvotes: 1

Views: 1775

Answers (3)

Tass
Tass

Reputation: 1628

When reading the file at time of attachment, I can use the following syntax.

mail.attachments[file.file_name.to_s] = File.read(path_to_file).force_encoding("BINARY").gsub(0xA0.chr,"")

The important addition is the following, which goes after the call to File.read(...):

.force_encoding("BINARY").gsub(0xA0.chr,"")

The stripping and encoding ought to be done at time of file upload to our system, so this answer isn't the resolution. It's a short-term band-aid.

Upvotes: 0

AnoE
AnoE

Reputation: 8345

With your edit, this seems pretty clear to me:

  1. The file on your filesystem is encoded in latin1.
  2. File.read uses the standard ruby encoding by default. If LANG contains something like "en_GB.utf8", File.read will associate the string with utf-8 encoding. You can verify this by logging the value of str.encoding (where str is the value of File.read).
  3. File.read does not actually verify the encoding, it only slurps in the bytes and slaps on the encoding (like force_encoding).
  4. Later, in ActionMailer, something wants to transcode the string, for whatever reason, and that fails as expected (and with the result you are noticing).

If your text files are encoded in latin1, then use File.read(path, encoding: Encoding::ISO_8859_1). This way, it may work. Let us know if it doesn't...

Upvotes: 0

Azolo
Azolo

Reputation: 4383

If Ruby is having problems reading the file and corrupting the characters during the read then try using File.binread. File.binread is inherited from IO

...
  attachments['attachment.txt'] = File.binread('/path/to/file')
...

If your file already has corrupted characters then you can either find some process to 'uncorrupt' them, which is not fun, or strip them using by re-encoding from ASCII-8bit to UTF-8 stripping out the invalid characters.

...
  attachments['attachment.txt'] = File.binread('/path/to/file')
    .encode('utf-8', 'binary', invalid: :replace, undef: :replace)
...

(String#scrub does this but since you can't read it in as UTF-8 then you cant use it.)

Upvotes: 2

Related Questions