Reputation: 81
I am learning BeautifulSoup and trying to scrape links of different questions that are present on this Quora page.
As I scroll down the website, questions present in the webpage keep coming up and displayed.
When I try to scrape the links to these questions using the code below, I only get,in my case, 5 links. ie - I only get links of 5 questions even though there are lot of questions on the site.
Is there any workaround to get as many links of questions present in the webpage?
from bs4 import BeautifulSoup
import requests
root = 'https://www.quora.com/topic/Graduate-Record-Examination-GRE-1'
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.' }
r = requests.get(root,headers=headers)
soup = BeautifulSoup(r.text,'html.parser')
q = soup.find('div',{'class':'paged_list_wrapper'})
no=0
for i in q.find_all('div',{'class':'story_title_container'}):
t=i.a['href']
no=no+1
print(root+t,'\n\n')
Upvotes: 1
Views: 752
Reputation: 61
The title is grabbed from the page and printed after formatting. This is one way to do it i'm sure there are many ways to do this and this only does one question.
import requests
from bs4 import BeautifulSoup
URL = "https://www.quora.com/Which-Deep-Learning-online-course-is-better-Coursera-specialization-VS-Udacity-Nanodegree-vs-FAST-ai"
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')
# grabs the text in the title
question = soup.select_one('title').text
# removes - quora at the end
x = slice(-8)
print(question[x])
Upvotes: 0
Reputation: 368
What you are trying to accomplish cannot be done with Requests and BeautifulSoup. You need to use Selenium.
Here i give the answer using selenium and chromedriver. Download chromedriver for you chrome version and install selenium pip install -U selenium
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import csv
browser = webdriver.Chrome(executable_path='/path/to/chromedriver')
browser.get("https://www.quora.com/topic/Graduate-Record-Examination-GRE-1")
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 5
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
no_of_pagedowns-=1
post_elems =browser.find_elements_by_xpath("//a[@class='question_link']")
for post in post_elems:
print(post.get_attribute("href"))
If you are using windows - executable_path='/path/to/chromedriver.exe'
change this variable no_of_pagedowns = 5
to specify how many times you want to scroll down.
I got the following output
Upvotes: 1