Reputation: 421
I am trying to scrape reviews from a webpage. The attached image shows that the reviews are in <p>
tag under a div class named "more reviewdata". I have used BeautifulSoup first and then Selenium for extracting the "more reviewdata" portion but failed, though other
<p>
and <div>
tags are extracting nicely. One of the several tutorial websites I visited, hinted that a dynamic page will not show all sources by clicking on Inspect. But here the review content is showing after clicking Inspect, which means this page is not dynamic. Is anybody there to suggest. Thanks in advance. For BeautifulSoup, my code is like this:
import requests
url = 'https://www.mouthshut.com/hindi-movies/Tanhaji-reviews-925997893'
response = requests.get(url)
page_contents = response.text
from bs4 import BeautifulSoup
doc = BeautifulSoup(page_contents, 'html.parser')
For Selenium and Chrome Driver I wrote:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", options=options)
import time
driver.get("https://www.mouthshut.com/hindi-movies/Tanhaji-reviews-925997893")
more_review_data_class = driver.find_elements_by_class_name("more reviewdata")
page_contents = driver.page_source
Upvotes: 1
Views: 253
Reputation: 33371
In case of multiple class names you should use css selector or XPath.
So instead of
more_review_data_class = driver.find_elements_by_class_name("more reviewdata")
Try this:
more_review_data = driver.find_elements_by_css_selector(".more.reviewdata p")
or this
more_review_data = driver.find_elements_by_xpath("//div[@class='more reviewdata']//p")
Also you should add some wait to make the page loaded before accessing the elements. So it will be something like this:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.more.reviewdata p")))
time.sleep(0.5)
more_review_data = driver.find_elements_by_css_selector(".more.reviewdata p")
or
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='more reviewdata']//p")))
time.sleep(0.5)
more_review_data = driver.find_elements_by_xpath("//div[@class='more reviewdata']//p")
To simply print the texts inside the element you can iterate on the elements list and print each element text like this:
for element in more_review_data:
print(element.text)
or
for element in more_review_data:
print(element.get_attribute("innerHTML"))
Upvotes: 3
Reputation: 3400
As you load the site contents for reviews are dynamic loaded so if you go to the Developer mode
to Network tab
and find data related to reviews link will appear which contents all reviews related to site!.
Code:
import requests
res=requests.get("https://www.mouthshut.com/Review/rar_reviews.aspx?cname=Tanhaji&cid=925997893&movie=1")
soup=BeautifulSoup(res.text,"lxml")
here I have used css class selector which returns list of data
main_data=soup.select("div.more.reviewdata")
for i in main_data:
print(i.find("p").get_text())
Here's the output of above script:
The movie is on real fact there was war for Kondhana ghad Tanhaji Malusare had attack on mughul on 4th - Feb 1670 and the brave fighter Tanhaji's one hand was cutted by Udaybhan but they still fighting and The Maratha's win the war I love the film and the unity of sawarj also great described in the film
. ....
Image:
Upvotes: 2
Reputation: 890
Did you try this ?
driver.find_elements_by_xpath("//div[@class='class name']")
in your case
driver.find_elements_by_xpath("//div[@class='more reviewdata']")
Upvotes: 0