Reputation: 1566
I am trying to understand how to scrape pages.
The results are not looping on the view page. It only shows the first one. Why?
LinksController:
class LinksController < ApplicationController
def craigslist_scrape
require 'open-uri'
url = "https://losangeles.craigslist.org/search/web"
page = Nokogiri::HTML(open(url))
@craigslist_info = page.css("ul.rows")
@link_info = @craigslist_info.at_css("li.result-row p.result-info a.result-title.hdrlnk")
@date = @craigslist_info.at_css("li.result-row p.result-info time.result-date")
end
end
View page: craigslist_scrape.html.erb:
<% @craigslist_info.each do |craig| %>
<p><%= "Title of the job: #{@link_info.text}" %></p>
<p><%= "Date: #{@date.text}" %></p>
<% end %>
Screenshot of only first results:
Routes:
Rails.application.routes.draw do
root 'links#craigslist_scrape'
end
Schema:
ActiveRecord::Schema.define(version: 20170308223314) do
enable_extension "plpgsql"
create_table "links", force: :cascade do |t|
t.string "link_info"
t.string "date"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
end
end
Upvotes: 0
Views: 138
Reputation: 5690
You're iterating over @craigslist_info
, but .css("ul.rows")
will only pick up a single element. You're also overwriting previous elements each time you call .at_css
Try something like:
page = Nokogiri::HTML(open(url))
@links = page.css("li.result-row p.result-info a.result-title.hdrlnk")
@dates = page.css("li.result-row p.result-info time.result-date")
And then in your view:
<% @links.each_with_index do |link, index| %>
<p><%= "Title of the job: #{link.text}" %></p>
<p><%= "Date: #{@dates[index].text}" %></p>
<% end %>
If you want to tidy things up, you can also model the scraped data in an easier to understand form. For example:
results = page.css("li.result-row p.result-info")
@result_objects = results.map { |o|
OpenStruct.new(
link: o.at_css("a.result-title.hdrlnk"),
date: o.at_css("time.result-date")
)
}
And then iterate over @result_objects
, knowing that you can access .link
and .date
for each one.
Upvotes: 0
Reputation: 265
It's probably because you are only scraping the first page of results. If you go to the url you are scraping "https://losangeles.craigslist.org/search/web" you can see that it's only showing you the first 100 results. If you scroll down and click "next" the link changes to "https://losangeles.craigslist.org/search/web?s=100". If you want to scrape ALL results, you need to create a method to scrape each page of the results.
Upvotes: 1
Reputation: 1975
Within your iteration of @craigslist_info, you are not referencing the placeholder, craig, and instead referencing only @link_info and @date. This will only produce one result. Within your iteration, you want to access the link_info and date of "craig".
<% @craigslist_info.each do |craig| %>
<% link_info = craig.at_css("li.result-row p.result-info a.result-title.hdrlnk") %>
<% date = craig.at_css("li.result-row p.result-info time.result-date")%>
<p><%= "Title of the job: #{link_info.text}" %></p>
<p><%= "Date: #{date.text}" %></p>
<% end %>
Upvotes: 0