Max Powah
Max Powah

Reputation: 21

Nokogiri scraping with Ruby On Rails not working as expected

I'm completely new to Ruby on Rails but I think I might be missing something obvious. I'm currently working on a webapp that scrapes auction websites. The bones of the app was created by someone else. I'm currently trying to add new website scrapes but they don't seem to be working.

I have read through some of the Nokogiri documentation, checked that the scraped information is indeed not being written to the database (the seeded URLs that are being targeted have been when I check via the rails console) and used the chrome extension CSS Selector Tester to check that I am targeting the correct CSS selectors. The record ids are correct when I check via the rails console.

I have put what I think are the important sections of code below, but I might be missing something that I don't realise is important.

The websites I'm having issues with are Lot-art.com & Lot-Tissimo.com

Any help will be much appreciated.

Seeded URLs

Source.create(name: "Auction.fr", query_template: "https://www.auction.fr/_en/lot/search/?contexte=futures&tri=date_debut%20ASC&query={query}&page={page}")
Source.create(name: "Invaluable.co.uk", query_template: "https://www.invaluable.co.uk/search/api/search-results?keyword={query}&size=1000")
Source.create(name: "Interencheres.com", query_template: "http://www.interencheres.com/en/recherche/lot?search%5Bkeyword%5D={query}&page={page}")
Source.create(name: "Gazette-drouot.com", query_template: "http://catalogue.gazette-drouot.com/html/g/recherche.jsp?numPage={page}&filterDate=1&query={query}&npp=100")
Source.create(name: "Lot-art.com", query_template: "http://www.lot-art.com/auction-search/?form_id=lot_search_form&page=1&mq=&q={query}&ord=recent")
Source.create(name: "Lot-tissimo.com", query_template: "https://lot-tissimo.com/en/cmd=s&lwr=&ww={query}&xw=&srt=SN&wg=EUR&page={page}")

Scheduler code

require 'rufus-scheduler'

require 'nokogiri'
require 'mechanize'
require 'open-uri'
require "net/https"


s = Rufus::Scheduler.singleton


s.interval '1m' do
  setting = Setting.find(1)
  agent = Mechanize.new

  agent.user_agent_alias = 'Windows Chrome'

  agent.cookie_jar.load(File.join(Rails.root, 'tmp/cookies.yaml'))
  List.all.each do |list|
    number_of_new_items = 0

    list.actions.each do |action|
      url = action.source.query_template.gsub('{query}', action.list.query)

      case action.source.id
      when 1 # Auction.fr
        20.downto(1) do |page|
          doc = Nokogiri::HTML(open(url.gsub('{page}', page.to_s)))

          doc.css("div.list-products > ul > li").reverse.each do |item_data|

            price = 0
            if item_data.at_css("h3.h4.adjucation.ft-blue") && /Selling price : ([\d\s]+) €/.match(item_data.at_css("h3.h4.adjucation.ft-blue").text)
              price = /Selling price : ([\d\s]+) €/.match(item_data.at_css("h3.h4.adjucation.ft-blue").text)[1].gsub(" ", "")
            end

            item = action.items.new(
              title: item_data.at_css("h2").text.strip,
              url: item_data.at_css("h2 a")["href"],
              picture: item_data.at_css("div.image-wrap.lazy div.image img")["src"],
              price: price,
              currency: "€"
            )

            ActiveRecord::Base.logger.silence do # This disable writing logs
              if item.save
                number_of_new_items = number_of_new_items + 1
              end
            end

          end
        end

      when 97 # Lot-Tissimo.com
        5.downto(1) do |page|
          doc = Nokogiri::HTML(open(url.gsub('{page}', page.to_s)))

          doc.css("#inhalt > .objektliste").reverse.each do |item_data|

      #      price = 0
      #      if item_data.at_css("h3.h4.adjucation.ft-blue") && /Selling price : ([\d\s]+) €/.match(item_data.at_css("h3.h4.adjucation.ft-blue").text)
      #        price = /Selling price : ([\d\s]+) €/.match(item_data.at_css("h3.h4.adjucation.ft-blue").text)[1].gsub(" ", "")
      #      end

            item = action.items.new(
              title: item_data.at_css("div.objli-desc").text.strip,
              url: item_data.at_css("td.objektliste-foto a")["href"],
              picture: item_data.at_css("td.objektliste-foto a#lot_link img")["src"],
              price: price,
              currency: "€"
            )

            ActiveRecord::Base.logger.silence do # This disable writing logs
              if item.save
                number_of_new_items = number_of_new_items + 1
              end
            end


          end
        end

      when 2 # Invaluable.co.uk
        doc = JSON.parse(open(url).read)

        doc["itemViewList"].reverse.each do |item_data|

          puts item_data["itemView"]["photos"]

          item = action.items.new(
            title: item_data["itemView"]["title"],
            url: "https://www.invaluable.co.uk/buy-now/" + item_data["itemView"]["title"].parameterize + "-" + item_data["itemView"]["ref"],
            picture: item_data["itemView"]["photos"] != nil ? item_data["itemView"]["photos"].first["_links"]["medium"]["href"] : nil,
            price: item_data["itemView"]["price"],
            currency: item_data["itemView"]["currencySymbol"]
          )

          ActiveRecord::Base.logger.silence do # This disable writing logs
            if item.save
              number_of_new_items = number_of_new_items + 1
            end
          end

        end



      when 3 # Interencheres.com

      #  doc = Nokogiri::HTML(open(url))
       5.downto(1) do |page|
        doc = Nokogiri::HTML(open(url.gsub('{page}', page.to_s)))

        doc.css("div#lots_0 div.ligne_vente").reverse.each do |item_data|

          price = 0


          item = action.items.new(
            title: item_data.at_css("div.ph_vente div.des_vente p a").text.strip,
            url: "http://www.interencheres.com" + item_data.at_css("div.ph_vente div.des_vente p a")["href"],
            picture: item_data.at_css("div.ph_vente div.gd_ph_vente img")["src"],
            price: price,
            currency: "€"
          )

          ActiveRecord::Base.logger.silence do # This disable writing logs
            if item.save
              number_of_new_items = number_of_new_items + 1
            end
            end

          end
        end

      when 4 # Gazette-drouot.com

         5.downto(1) do |page|
       #   doc = Nokogiri::HTML(open(url.gsub('{page}', page.to_s)))
         doc = agent.get(url.gsub('{page}', page.to_s))
      #  doc = agent.get(url)
        doc.css("div#recherche_resultats div.lot_recherche").reverse.each do |item_data|

          price = 0

          picture = item_data.at_css("img.image_thumb_recherche") ? item_data.at_css("img.image_thumb_recherche")["src"] : nil
          item = action.items.new(
            title: item_data.at_css("#des_recherche").text.strip.truncate(140),
            url: "http://catalogue.gazette-drouot.com/html/g/" + item_data.at_css("a.lien_under")["href"],
            picture: picture,
            price: price,
            currency: "€"
          )

          ActiveRecord::Base.logger.silence do # This disable writing logs
            if item.save
              number_of_new_items = number_of_new_items + 1
            end
          end
          end

        end

      when 69 # Lot-art.com

        doc = agent.get(url)
        doc.css("div.lot_list_holder").reverse.each do |item_data|

          price = 0

          item = action.items.new(
            title: item_data.at_css("div.lot_list_body a")[0].text.strip.truncate(140),
            url: item_data.at_css("div.lot_list_body")["href"],
            picture: item_data.at_css("a.lot_list_thumb img") ["src"],
            price: price,
            currency: "€"
          )

          ActiveRecord::Base.logger.silence do # This disable writing logs
            if item.save
              number_of_new_items = number_of_new_items + 1
            end
          end


        end

      end

    end

    if number_of_new_items > 0 && setting.notifications_per_hour > setting.notifications_this_hour && setting.pushover_app_token.present? && setting.pushover_user_key.present?
      url = URI.parse("https://api.pushover.net/1/messages.json")
      req = Net::HTTP::Post.new(url.path)
      req.set_form_data({
                          :token => setting.pushover_app_token,
                          :user => setting.pushover_user_key,
                          :message => "#{number_of_new_items} new items on #{list.name}!",
                          :url_title => "Check the list",
                          :url => "http://spottheauction.com/lists/#{list.id}"
      })
      res = Net::HTTP.new(url.host, url.port)
      res.use_ssl = true
      res.verify_mode = OpenSSL::SSL::VERIFY_PEER
      res.start {|http| http.request(req) }
    end
  end
  agent.cookie_jar.save(File.join(Rails.root, 'tmp/cookies.yaml'))
end

s.cron '0 * * * *' do
  setting = Setting.find(1)
  setting.notifications_this_hour = 0
  setting.save
end

Upvotes: 2

Views: 226

Answers (2)

Max Powah
Max Powah

Reputation: 21

In case someone else comes across this. I got the scraping of lot-art.com to work. It seemed that I was lacking specificity in the css selector for nokogiri to pull the correct data.

I am still having continuing issues with lot-tissimo although that appears to be from something else as other scrapers have issues such as scraping-hub's portia spiders.

Upvotes: 0

spickermann
spickermann

Reputation: 107142

new just initializes an instance but doesn't save the instance. Do you actually call save somewhere?

You have two options:

Call save on the item:

item = action.items.new(
  # ...
)
item.save

Or use create instead of new:

item = action.items.create(
  # ...
)

Upvotes: 2

Related Questions