Reputation: 10363
I started using mime to parse email and extract attachments. Anything I did, the binary attachment always got corrupted when I wrote it to disk. Then I realized that, for some weird reason, all the base64 attachments are already decoded when the message is parsed into data types. That's when my problem starts.
If it is an image, it does not work.
The first thing I did was to convert the extracted Text attachment to ByteString with TE.encodeUtf8
.
No luck. I tried all Text.Encoding
functions to convert Text to ByteString - nothing works.
Then for some stupid reason I converted/encoded the extracted text back to base64, then again I decoded it from base64 and it worked this time. Why?
So if I encode the extracted attachment to base64 and decode it back, it works.
B.writeFile "tmp/test.jpg" $ B.pack $ decode $ encodeRawString True $ T.unpack attachment
Why? Why simple encoding of Text to ByteString did not work but the above silliness does?
Eventually, I played with it a bit more and got to the point when it works with Data.ByteString.Char8
like this B.writeFile "tmp/test.jpg" $ BC.pack $ T.unpack attachment
So I still have to convert Text to String, then String to ByteString.Char8 and only then it works and I get uncorrupted image.
Can please someone explain all this. Why such a pain with binary image attachment? Why can't I convert base64 decoded Text to ByteString? What am I missing?
Thank you.
UPDATE
This is the code to extract the attachment as requested. I thought it was not relevant to text encoding/decoding.
import Codec.MIME.Parse
import Codec.MIME.Type
import Data.Maybe
import Data.Text (Text, unpack, strip)
import qualified Data.Text as T (null)
import Data.Text.Encoding (encodeUtf8)
import Data.ByteString (ByteString)
data Attachment = Attachment { attName :: Text
, attSize :: Int
, attBody :: Text
} deriving (Show)
genAttach :: Text -> [Attachment]
genAttach m =
let prs v = if isAttach v
then [Just (mkAttach v)]
else case mime_val_content v of
Single c -> if T.null c
then [Nothing]
else prs $ parseMIMEMessage c
Multi vs -> concatMap prs vs
in let atts = filter isJust $ prs $ parseMIMEMessage m
in if null atts then [] else map fromJust atts
isAttach :: MIMEValue -> Bool
isAttach mv =
maybe False check $ mime_val_disp mv
where check d = if (dispType d) == DispAttachment then True else False
mkAttach :: MIMEValue -> Attachment
mkAttach v =
let prms = dispParams $ fromJust $ mime_val_disp v
Single cont = mime_val_content v
name = check . filter isFn
where isFn (Filename _) = True
isFn _ = False
check = maybe "" (\(Filename n) -> n) . listToMaybe
size = check . filter isSz
where isSz (Size _) = True
isSz _ = False
check = maybe "" (\(Size n) -> n) . listToMaybe
in Attachment { attName = name prms
, attSize = let s = size prms
in if T.null s then 0 else read $ unpack s
, attBody = cont
}
Upvotes: 5
Views: 2006
Reputation: 1896
The mime
package uses Text
throughout, but undoes any base64
(or quoted-printable
) encoding for the bodies of message parts according to their Content-Encoding
MIME header field.
What this means is that the body of the resulting message part is of type Text
but (unless the MIME type is text/*
) this is not an appropriate type as the body should really be a sequence of bytes, not characters. The characters that it uses are Unicode code points 00
to FF
which have an obvious mapping onto bytes, but they are not the same thing. (Furthermore, if the MIME type is text/*
but the charset
is not us-ascii
or iso8859-1
then I think mime
will garble the content.)
What I suspect was happening was that you were writing the Text
out to disk you were using Data.Text.IO.writeFile
or similar, which uses the character encoding specified in your environment to convert the characters to bytes. Many common character encodings map characters 00
to 7F
onto bytes 00
to 7F
but hardly any map the remaining characters 80
to FF
onto their corresponding bytes. On many systems these days the environment's encoding is UTF8 which doesn't even map those characters into single bytes. (Were the file sizes different from what you expected?)
In order to write it out correctly you needed to convert this characters to the right bytes first. The simplest way to do this is to use the functions in the Data.Bytestring.Char8
module, which are designed to confuse bytes and characters in exactly this way. But they work on String
s and not on Text
s so you have to unpack and repack everything.
I'm not sure how you managed to base-64 encode your Text
value, as base-64 encoding also operates on bytes and not characters. However you did it, you must have managed to map characters 80
to FF
onto the corresponding bytes directly rather than encoding them in any way.
If you want to know more, there is a good article on this here: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Upvotes: 0
Reputation: 52039
Note that the mime package chooses to represent binary content with a Text
value. The way to derive the corresponding ByteString
is to latin1 encode the text. In this case it is guaranteed that all of the code points in the text string will be in the range 0 - 255.
Create a file with this content:
Content-Type: image/gif
Content-Transfer-Encoding: base64
R0lGODlhAQABAIABAP8AAP///yH5BAEAAAEALAAAAAABAAEAAAICRAEAOw==
This is the base64 encoding of the 1x1 red GIF image at http://commons.wikimedia.org/wiki/File:1x1.GIF
Here is some code which uses parseMIMEMessage
to recreate this file.
import Codec.MIME.Parse
import Codec.MIME.Type
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import qualified Data.ByteString.Char8 as BS
import System.IO
test1 path = do
msg <- TIO.readFile path
let mval = parseMIMEMessage msg
Single img = mime_val_content mval
withBinaryFile "out-io" WriteMode $ \h -> do
hSetEncoding h latin1
TIO.hPutStr h img
test2 path = do
msg <- TIO.readFile path
let mval = parseMIMEMessage msg
Single img = mime_val_content mval
bytes = BS.pack $ T.unpack img
BS.writeFile "out-bs" bytes
In test2
the latin1 encoding is accomplished with BS.pack . T.unpack
.
Upvotes: 2