user273072545345
user273072545345

Reputation: 1566

Nokogiri results not looping

I am trying to understand how to scrape pages.

The results are not looping on the view page. It only shows the first one. Why?

LinksController:

class LinksController < ApplicationController

    def craigslist_scrape
        require 'open-uri'

        url = "https://losangeles.craigslist.org/search/web"

        page = Nokogiri::HTML(open(url))

        @craigslist_info = page.css("ul.rows")

        @link_info = @craigslist_info.at_css("li.result-row p.result-info a.result-title.hdrlnk")
        @date = @craigslist_info.at_css("li.result-row p.result-info time.result-date")
    end

end

View page: craigslist_scrape.html.erb:

<% @craigslist_info.each do |craig| %>
    <p><%= "Title of the job: #{@link_info.text}" %></p>
    <p><%= "Date: #{@date.text}" %></p>
<% end %>

Screenshot of only first results:

enter image description here

Routes:

Rails.application.routes.draw do
    root 'links#craigslist_scrape'
end

Schema:

ActiveRecord::Schema.define(version: 20170308223314) do
  enable_extension "plpgsql"

  create_table "links", force: :cascade do |t|
    t.string   "link_info"
    t.string   "date"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
  end

end

Upvotes: 0

Views: 138

Answers (3)

gwcodes
gwcodes

Reputation: 5690

You're iterating over @craigslist_info, but .css("ul.rows") will only pick up a single element. You're also overwriting previous elements each time you call .at_css

Try something like:

page = Nokogiri::HTML(open(url))
@links = page.css("li.result-row p.result-info a.result-title.hdrlnk")
@dates = page.css("li.result-row p.result-info time.result-date")

And then in your view:

<% @links.each_with_index do |link, index| %>
  <p><%= "Title of the job: #{link.text}" %></p>
  <p><%= "Date: #{@dates[index].text}" %></p>
<% end %>

If you want to tidy things up, you can also model the scraped data in an easier to understand form. For example:

results = page.css("li.result-row p.result-info")
@result_objects = results.map { |o|
                    OpenStruct.new(
                      link: o.at_css("a.result-title.hdrlnk"),
                      date: o.at_css("time.result-date")
                    )
                  }

And then iterate over @result_objects, knowing that you can access .link and .date for each one.

Upvotes: 0

C.Kelly
C.Kelly

Reputation: 265

It's probably because you are only scraping the first page of results. If you go to the url you are scraping "https://losangeles.craigslist.org/search/web" you can see that it's only showing you the first 100 results. If you scroll down and click "next" the link changes to "https://losangeles.craigslist.org/search/web?s=100". If you want to scrape ALL results, you need to create a method to scrape each page of the results.

Upvotes: 1

jamesvphan
jamesvphan

Reputation: 1975

Within your iteration of @craigslist_info, you are not referencing the placeholder, craig, and instead referencing only @link_info and @date. This will only produce one result. Within your iteration, you want to access the link_info and date of "craig".

<% @craigslist_info.each do |craig| %>
    <% link_info = craig.at_css("li.result-row p.result-info a.result-title.hdrlnk") %> 
    <% date = craig.at_css("li.result-row p.result-info time.result-date")%>
    <p><%= "Title of the job: #{link_info.text}" %></p>
    <p><%= "Date: #{date.text}" %></p>
<% end %>

Upvotes: 0

Related Questions