Reputation: 73
I have been trying to implement a project to scrape questions from Quora based on a topic and have been using this resource as a foundation - https://github.com/Theminijohn/quora-scraper As shown in this page, the followers are being extracted as expected for each question. However upon implementing the same in my system, for each question the follower count is shown zero even if it is not zero. Column Follower always has zero value as shown here
The code which is responsible for extracting the number of followers is this:
follower_count = q.css('.FollowActionItem .icon_action_bar-label span > span:last-child').text.to_i
Everything else is working as expected. What am I missing here?
Edit: The whole Code snippet is as follows:
require 'rubygems'
require 'ruby-progressbar'
require 'Nokogiri'
require 'csv'
require 'pry'
ENGAGEMENT_THRESHOLD = 5
# init progressbar
progressbar = ProgressBar.create( format: '%a %bᗧ%i %p%% %t',
progress_mark: ' ',
remainder_mark: '・')
# parse file
doc = File.open("input.html") { |x| Nokogiri::HTML(x) }
questions = doc.css('.TopicAllQuestionsList .pagedlist_item')
# identifiers
canonical_link = doc.at('link[rel="canonical"]')['href']
topic_name = canonical_link.match(/quora.com\/topic\/(.*)/)[1]
# update progressbar
progressbar.total = questions.count
# prepare csv
unless File.exist?('quora-data.csv')
CSV.open("quora-data.csv", "w+") do |csv|
csv << [
"Topic", "Title", "Followers", "Answers", "Ratio", "Engagement potential",
"Last action", "Parsed time", "Question link"
]
end
end
questions.each do |q|
link = "https://www.quora.com" + q.css('a.question_link').attr('href').value
title = q.css('a.question_link').text.strip
answer_count = q.css('.QuestionFooter .answer_count_prominent').text.strip.to_i
follower_count = q.css('.FollowActionItem .icon_action_bar-label span > span:last-child').text.to_i
ratio = "#{follower_count}/#{answer_count}"
if answer_count == 0
take_action = (follower_count >= ENGAGEMENT_THRESHOLD) ? "Yes" : "No"
else
take_action = ((follower_count / answer_count) >= ENGAGEMENT_THRESHOLD) ? "Yes" : "No"
end
# timestamps
raw_time = q.css('.QuestionFooter .question_timestamp').text.strip
last_action = raw_time.include?("Last requested") ? "Requested" : "Followed"
if raw_time.include?('ago')
if raw_time.scan(/(\d*)h/).flatten.any?
hours_ago = raw_time.scan(/(\d*)h/).flatten[0].to_f
parsed_time = (DateTime.now - (hours_ago / 24)).strftime('%Y-%m-%d')
elsif raw_time.scan(/(\d*)m/).flatten.any?
minutes_ago = raw_time.scan(/(\d*)m/).flatten[0].to_f
parsed_time = (DateTime.now - (1.0 / 24 / 60)).strftime('%Y-%m-%d')
end
else
if raw_time.count("0-9") > 0
parsed_time = Date.parse(raw_time).strftime("%Y-%m-%d")
else
parsed_time =
(Date.today < Date.parse(raw_time)) ? (Date.parse(raw_time) - 7) : Date.parse(raw_time)
end
end
CSV.open("quora-data.csv", "a+") do |csv|
csv << [
topic_name, title, follower_count, answer_count, ratio,
take_action, last_action, parsed_time, link
]
end
# move progressbar
progressbar.increment
end
<!DOCTYPE html>
<!-- saved from url=(0099)file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora.html -->
<html lang="en" class="js-wf-loaded"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><link rel="icon" href="https://qsf.fs.quoracdn.net/-3-images.favicon.ico-26-ae77b637b1e7ed2c.ico"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q-icons.q-icons.woff2-26-9afc20a49e3ef2cf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_regular.woff2-26-7ace3bc4cbe404d9.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_regular_italic.woff2-26-9d81ab3229809d01.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_semibold.woff2-26-b55bf39d9018ace9.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_semibold_italic.woff2-26-4c39f22524232bf2.woff2"><script src="./input_files/sdk.js.download" async="" crossorigin="anonymous"></script><script src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/sdk.js.download" async="" crossorigin="anonymous"></script><script async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/analytics.js.download"></script><script type="text/javascript" async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/widgets.js.download"></script><script type="text/javascript" async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/sdk.js(1).download"></script><script type="text/javascript">window.Q = {"fontFamilies": ["q-icons", "q_serif"], "errorSamplingRate": 1.0, "revision": "41e9b4435b78728ddf351e72a6dc45ca9708ebc2", "subdomainSuffix": "quora.com"};window["webpackManifest"] = {"ads_manager": "https://qsc.fs.quoracdn.net/-3-chunk.web.ads_manager.js.out-34-1e09a2ca57288a3c.webpack", "content_widgets": "https://qsc.fs.quoracdn.net/-3-chunk.web.content_widgets.js.out-34-9a6c124eee999cb7.webpack", "dev": "https://qsc.fs.quoracdn.net/-3-chunk.web.dev.js.out-34-5d22ece0a38f03a1.webpack", "internal": "https://qsc.fs.quoracdn.net/-3-chunk.web.internal.js.out-34-2e41b1b9af1f0f88.webpack", "qtext2": "https://qsc.fs.quoracdn.net/-3-chunk.web.qtext2.js.out-34-b3d77df0693a06da.webpack", "main": "https://qsc.fs.quoracdn.net/-3-chunk.web.main.js.out-34-835b38fb05330b9f.webpack", "firebase": "https://qsc.fs.quoracdn.net/-3-chunk.web.firebase.js.out-34-eadc5f3144befc37.webpack", "publisher_dashboard": "https://qsc.fs.quoracdn.net/-3-chunk.web.publisher_dashboard.js.out-34-0c43bcc87e209b23.webpack"};window["webpackChunks"] = ["main"];window["PAGE_IS_MOBILE"] = false;var assetErrs=[];document.addEventListener("DOMContentLoaded",function(e){if(0!==assetErrs.length){var s="assets="+encodeURIComponent(JSON.stringify(assetErrs)),t=new XMLHttpRequest;t.open("POST","/ajax/log_browser_asset_load_error_3RD_PARTY_POST",!0),t.setRequestHeader("Content-Type","application/x-www-form-urlencoded; charset=UTF-8"),t.setRequestHeader("Accept","*/*"),t.send(s.replace(/%20/g,"+"))}}),window.addAssetErr=function(e){e&&assetErrs.push(e)};
Complete HTML file can be found here- https://drive.google.com/file/d/1_X86tq5TTw4ikk-hQ2Ixd13Y_hR4scBg/view?usp=sharing
The HTML containing info of the number of followers is:
<div class="FollowActionItem ItemComponent primary_item u-relative"><span id="wVP1Ux4a11"><a class="ui_button ui_button--styled ui_button--FlatStyle ui_button--FlatStyle--gray ui_button--size_regular u-inline-block ui_button--non_link ui_button--supports_icon ui_button--has_icon" href="#" role="button" action_click="QuestionFollow" action_target="{"qid": 44394942, "type": "question"}" id="__w2_wVP1Ux4a27_button"><div class="ui_button_inner" id="__w2_wVP1Ux4a27_inner"><div class="ui_button_icon_wrapper u-relative u-flex-inline"><div id="__w2_wVP1Ux4a27_icon"><span class="ui_button_icon" aria-hidden="true"><svg width="24px" height="24px" viewBox="0 0 24 24" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g stroke="none" fill="none" fill-rule="evenodd" stroke-linecap="round">
<g id="follow" class="icon_svg-stroke" stroke="#666" stroke-width="1.5">
<path d="M14.5,19 C14.5,13.3369229 11.1630771,10 5.5,10 M19.5,19 C19.5,10.1907689 14.3092311,5 5.5,5" id="lines"></path>
<circle id="circle" cx="7.5" cy="17" r="2" class="icon_svg-fill" fill="none"></circle>
</g>
</g>
</svg></span></div></div><div class="ui_button_label_count_wrapper"><span class="ui_button_label" id="__w2_wVP1Ux4a27_label">Follow</span><span class="ui_button_count" aria-hidden="true" id="__w2_wVP1Ux4a27_count_wrapper"><span class="bullet"> · </span><span class="ui_button_count_inner" id="__w2_wVP1Ux4a27_count">1</span></span></div></div></a></span></div>
Upvotes: 2
Views: 308