How to parse encoded HTML

Question

I'm working on a digest email to send to users of my companies app. For this I'm going through each users emails and trying to find some basic information about each email (from, subject, timestamp, and, the aspect that's causing me difficulty, an image).

I assumed Nokogiri's search('img') function would be fine to pull out images. Unfortunately it looks like most emails have a lot of garbage embedded in the URLs of those images, like newlines (" "), escape characters (""), and the string "3D" for some reason. For example:



This is causing the search to only pull out pieces of the actual URLs/src's:

#(Element:0x3fd0c8e83b80 {
  name = "img",
  attributes = [
    #(Attr:0x3fd0c8e82a28 { name = "src", value = "3D%22https://=" }),
    #(Attr:0x3fd0c8e82a14 { name = "d3ui957tjb5bqd.cloudfront.net", value = "" }),
    #(Attr:0x3fd0c8e82a00 { name = "width", value = "3D"223"" }),
    #(Attr:0x3fd0c8e829ec { name = "heigh", value = "t=3D"84"" }),
    #(Attr:0x3fd0c8e829d8 { name = "alt", value = "3D"Creative" }),
    #(Attr:0x3fd0c8e829c4 { name = "market", value = "" }),
    #(Attr:0x3fd0c8e829b0 { name = "border", value = "3D"0"" })]
  }) 


Does anyone have an idea why this is happening, and how to remove all this junk? 

I'm getting decent results from lots of gsub's and safety checks but it feels pretty tacky.

I've also tried Sanitize.clean which doesn't work and the PermitScrubber mentioned in "How to sanitize html string except image url?".

matt · Accepted Answer

The mail body is encoded as quoted printable. You will need to decode the body before you parse it with Nokogiri. You can do this fairly easily with Ruby using unpack:

decoded = encoded.unpack('M').first

You should check what the encoding is by looking at the mail headers before trying to decode, not all mail is encoded this way, and there are other types of encoding.

How to parse encoded HTML

Answers (2)

Related Questions