Reputation: 395
I built a scraper to pull all the information out of a Wikipedia table and upload it to my database. All was good until I realized I was pulling the wrong URL on images, and I wanted the actual image URL "http://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Baconbutty.jpg" and not the "/wiki/File:Baconbutty.jpg" it was apt to give me. Here is my code so far:
def initialize
@url = "http://en.wikipedia.org/wiki/List_of_sandwiches"
@nodes = Nokogiri::HTML(open(@url))
end
def summary
sammich_data = @nodes
sammiches = sammich_data.css('div.mw-content-ltr table.wikitable tr')
sammich_data.search('sup').remove
sammich_hashes = sammiches.map {|x|
if content = x.css('td')[0]
name = content.text
end
if content = x.css('td a.image').map {|link| link ['href']}
image =content[0]
end
if content = x.css('td')[2]
origin = content.text
end
if content = x.css('td')[3]
description =content.text
end
My issue is with this line:
if content = x.css('td a.image').map {|link| link ['href']}
image =content[0]
If I go to td a.image img
, it just gives me a null
entry.
Any suggestions?
Upvotes: 0
Views: 258
Reputation: 160551
Here's how I'd do it (if I was to scrape Wikipedia, which I wouldn't because they do have an API for this stuff):
require 'nokogiri'
require 'open-uri'
require 'pp'
doc = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/List_of_sandwiches"))
sammich_hashes = doc.css('table.wikitable tr').map { |tr|
name, image, origin, description = tr.css('td,th')
name, origin, description = [name, origin, description].map{ |n| n && n.text ? n.text : nil }
image = image.at('img')['src'] rescue nil
{
name: name,
origin: origin,
description: description,
image: image
}
}
pp sammich_hashes
Which outputs:
[
{:name=>"Name", :origin=>"Origin", :description=>"Description", :image=>nil},
{
:name=>"Bacon",
:origin=>"United Kingdom",
:description=>"Often served with ketchup or brown sauce",
:image=>"//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Baconbutty.jpg/120px-Baconbutty.jpg"
},
... [lots removed] ...
{
:name=>"Zapiekanka",
:origin=>"Poland",
:description=>"A halved baguette or other bread usually topped with mushrooms and cheese, ham or other meats, and vegetables",
:image=>"//upload.wikimedia.org/wikipedia/commons/thumb/1/12/Zapiekanka_3..jpg/120px-Zapiekanka_3..jpg"
}
]
If an image isn't available, the field will be set to nil
in the returned hashes.
Upvotes: 1
Reputation: 1258
You could use the srcset
attribute of the img
element, split it and keep one of the available resized images.
if content = x.at_css('td a.image img')
image =content['srcset'].split(' 1.5x,').first
Upvotes: 0