Michal
Michal

Reputation: 89

nokogiri scrape all children divs from a selected div

I was playing around with Nokogiri in my free time, and I am afraid I got really stuck.I am trying to solve this problem since this morning (almost 8h now :( ) and it looks that I didn't progress at all. On the website I want to scrape all the threads on the page.So far I realize that parent for all threads is

<div id="threads" class="extended-small">

each thread consist of 3 elements:

  1. link to the image
  2. div#title that contains value of replies(R) and images(I)
  3. div#teaser that contains the name of the thread

My question is how can I select the children of the id='threads' and push each child with 3 elements to the array ? As you can see in this code I don't really know what I am doing and I would very , very much appreciate

require 'httparty'
require 'nokogiri'
require 'json'
require 'pry'
require 'csv'

page = HTTParty.get('https://boards.4chan.org/g/catalog')

parse_page = Nokogiri::HTML(page)

threads_array = []

threads = parse_page.search('.//*[@id="threads"]/div') do |a|
    post_id = a.text
    post_pic = a.text
    post_title = a.text
    post_teaser = a.text
threads_array.push(post_id,post_pic,post_title,post_teaser)
end

CSV.open('sample.csv','w') do |csv|
    csv << threads_array
end

Pry.start(binding)

page and code

Upvotes: 0

Views: 429

Answers (1)

Philip Hallstrom
Philip Hallstrom

Reputation: 19879

Doesn't look like the raw HTML source contains those fields which is why you're not seeing it when parsing with HTTParty and Nokogiri. It looks like they put the data in a JS variable farther up. Try this:

require 'rubygems'
require 'httparty'
require 'json'

page = HTTParty.get('https://boards.4chan.org/g/catalog')
m = page.match(/var catalog = ({.*?});var/)
json_str = m.captures.first
catalog = JSON.parse(json_str)
pp catalog

Whether that is robust enough I'll let you decide :)

Upvotes: 3

Related Questions