Reputation: 665
I'm working on a digest email to send to users of my companies app. For this I'm going through each users emails and trying to find some basic information about each email (from, subject, timestamp, and, the aspect that's causing me difficulty, an image).
I assumed Nokogiri's search('img')
function would be fine to pull out images. Unfortunately it looks like most emails have a lot of garbage embedded in the URLs of those images, like newlines ("\n"), escape characters ("\"), and the string "3D" for some reason. For example:
<img src=3D\"https://=\r\nd3ui957tjb5bqd.cloudfront.net/images/emails/1/logo.png\"
This is causing the search to only pull out pieces of the actual URLs/src's:
#(Element:0x3fd0c8e83b80 {
name = "img",
attributes = [
#(Attr:0x3fd0c8e82a28 { name = "src", value = "3D%22https://=" }),
#(Attr:0x3fd0c8e82a14 { name = "d3ui957tjb5bqd.cloudfront.net", value = "" }),
#(Attr:0x3fd0c8e82a00 { name = "width", value = "3D\"223\"" }),
#(Attr:0x3fd0c8e829ec { name = "heigh", value = "t=3D\"84\"" }),
#(Attr:0x3fd0c8e829d8 { name = "alt", value = "3D\"Creative" }),
#(Attr:0x3fd0c8e829c4 { name = "market", value = "" }),
#(Attr:0x3fd0c8e829b0 { name = "border", value = "3D\"0\"" })]
})
Does anyone have an idea why this is happening, and how to remove all this junk?
I'm getting decent results from lots of gsub
's and safety checks but it feels pretty tacky.
I've also tried Sanitize.clean
which doesn't work and the PermitScrubber mentioned in "How to sanitize html string except image url?".
Upvotes: 0
Views: 485
Reputation: 6321
I am not a master in scraping, but you are able to get it through the CSS attribute
.at_css("img")['src']
For example:
require "open-uri"
require "nokogiri"
doc = open(url_link)
page = Nokogiri::HTML(doc)
page.css("div.col-xs-12.visible-xs.visible-sm div.school-image").each do |pic|
img = pic.at_css("img")['src'].downcase if pic.at_css("img")
end
Upvotes: 1
Reputation: 79743
The mail body is encoded as quoted printable. You will need to decode the body before you parse it with Nokogiri. You can do this fairly easily with Ruby using unpack
:
decoded = encoded.unpack('M').first
You should check what the encoding is by looking at the mail headers before trying to decode, not all mail is encoded this way, and there are other types of encoding.
Upvotes: 3