Reputation: 79
I need to seed or scrape the data from another site in order to have content for my project.
How do you go about scraping data from another site using your own rails app? Do you use a separate application/server to run some sort of cron job, then add that data to your rails app? Or is it possible to have your own site scrape the data and display it directly?
My first idea was to scrape a site using Mechanize, then add the data to the Fixtures in my rails app as seed data. Is there a better way? Maybe even a way to continuously scrape the other site to display the data using my own rails app?
Upvotes: 2
Views: 279
Reputation: 161
I use Heroku and it comes with something called scheduler that works quite well for my little project. I believe it works very similar to cron.
Once the data get scraped, it goes directly into database(psql) then you could display whatever you wanted through database query.
Upvotes: 0
Reputation: 1163
You can use rufus scheduler and watir-dom-wait gem for your problem solution. I have also done a similar task for scraping for amazon kdp book list fetch by using the watir-dom-wait gem you can also fetch the data for ajax call request the mechanize and Nokogiri will not work for Ajax
require 'rufus-scheduler'
require 'watir-dom-wait'
require 'selenium-webdriver'
scheduler = Rufus::Scheduler.new
scheduler.in '1d' do
download_report
end
#download the report form amazon kdp
def download_report
#login
@browser = Watir::Browser.new :chrome, options: {prefs: prefs}
@browser.goto 'https://kdp.amazon.com/en_US/reports-new'
@browser.input(:name => "email").send_keys("[email protected]")
@browser.input(:name => "password").send_keys("password")
@browser.input(:id => 'signInSubmit').click
@browser.span(:text => "Generate Report").click
end
Upvotes: 2
Reputation: 1109
I use Nokogiri to scrape websites.
You don't need a separate application. You can have methods inside your models that deal with all the scraping and populating your database and then you can create a rake file that will run those functions.
I name mine scheduler.rake
This goes in /lib/tasks/
And then if you're using Heroku you will be able to add the Scheduler plugin (It's available for free 28/12/2018)
Heroku has some pretty good docs explaining how you can configure things on the Heroku side of things.
Upvotes: 0