Reputation: 421

BeautifulSoup and Selenium cannot fetch <p> content under nested <div>

I am trying to scrape reviews from a webpage. The attached image shows that the reviews are in <p> tag under a div class named "more reviewdata". I have used BeautifulSoup first and then Selenium for extracting the "more reviewdata" portion but failed, though other <p> and <div> tags are extracting nicely. One of the several tutorial websites I visited, hinted that a dynamic page will not show all sources by clicking on Inspect. But here the review content is showing after clicking Inspect, which means this page is not dynamic. Is anybody there to suggest. Thanks in advance. For BeautifulSoup, my code is like this:

import requests
url = 'https://www.mouthshut.com/hindi-movies/Tanhaji-reviews-925997893'
response = requests.get(url)
page_contents = response.text
from bs4 import BeautifulSoup
doc = BeautifulSoup(page_contents, 'html.parser')

For Selenium and Chrome Driver I wrote:

from selenium import webdriver    
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", options=options)
import time    
driver.get("https://www.mouthshut.com/hindi-movies/Tanhaji-reviews-925997893")
more_review_data_class = driver.find_elements_by_class_name("more reviewdata")
page_contents = driver.page_source

Upvotes: 1

Answers (3)

Prophet

Reputation: 33371

In case of multiple class names you should use css selector or XPath.
So instead of

more_review_data_class = driver.find_elements_by_class_name("more reviewdata")

Try this:

more_review_data = driver.find_elements_by_css_selector(".more.reviewdata p")

or this

more_review_data = driver.find_elements_by_xpath("//div[@class='more reviewdata']//p")

Also you should add some wait to make the page loaded before accessing the elements. So it will be something like this:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.more.reviewdata p")))
time.sleep(0.5)
more_review_data = driver.find_elements_by_css_selector(".more.reviewdata p")

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='more reviewdata']//p")))
time.sleep(0.5)
more_review_data = driver.find_elements_by_xpath("//div[@class='more reviewdata']//p")

To simply print the texts inside the element you can iterate on the elements list and print each element text like this:

for element in more_review_data:
    print(element.text)

for element in more_review_data:
    print(element.get_attribute("innerHTML"))

Upvotes: 3

Bhavya Parikh

Reputation: 3400

As you load the site contents for reviews are dynamic loaded so if you go to the Developer mode to Network tab and find data related to reviews link will appear which contents all reviews related to site!.

Code:

import requests
res=requests.get("https://www.mouthshut.com/Review/rar_reviews.aspx?cname=Tanhaji&cid=925997893&movie=1")
soup=BeautifulSoup(res.text,"lxml")

here I have used css class selector which returns list of data

main_data=soup.select("div.more.reviewdata")
for i in main_data:
    print(i.find("p").get_text())

Here's the output of above script:

   The movie is on real fact there was war for Kondhana ghad Tanhaji Malusare had attack on mughul on  4th - Feb 1670 and the brave fighter Tanhaji's one hand was cutted by Udaybhan but they still fighting and The Maratha's win the war I love the film and the unity of sawarj also great described in the film 
. ....

Image:

Upvotes: 2

Shabari nath k

Reputation: 890

Did you try this ?

driver.find_elements_by_xpath("//div[@class='class name']")

in your case

driver.find_elements_by_xpath("//div[@class='more reviewdata']")

Upvotes: 0

BeautifulSoup and Selenium cannot fetch &lt;p&gt; content under nested &lt;div&gt;

Answers (3)

Related Questions

BeautifulSoup and Selenium cannot fetch <p> content under nested <div>