Monisha
Monisha

Reputation: 65

Web scraping using selenium

My intention is to get the name, location, time of posting, title of the review and the whole review content from the web page (http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061).

My code :

    from bs4 import BeautifulSoup
    from selenium  import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

    firefox_capabilities = DesiredCapabilities.FIREFOX
    firefox_capabilities['marionette'] = True
    firefox_capabilities['binary'] = '/etc/firefox'

    driver = webdriver.Firefox(capabilities=firefox_capabilities)
    driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061')
    soup = BeautifulSoup(driver.page_source,"lxml")
    for link in soup.select(".profile"):
        try:
           profile = link.select("p:nth-of-type(1) a")[0]
           profile1 = link.select("p:nth-of-type(2)")[0]
        except:pass      
           print(profile.text,profile1.text)
   driver = webdriver.Firefox(capabilities=firefox_capabilities)
   driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061')
   soup1 = BeautifulSoup(driver.page_source,"lxml")
   for link in soup1.select(".col-10.review"):
      try:
        profile2 = link.select("small:nth-of-type(1)")[0]
        profile3 = link.select("span:nth-of-type(3)")[0]
        profile4 = link.select("a:nth-of-type(1)")[0]
      except:pass
        print(profile2.text,profile3.text,profile4.text)
   driver = webdriver.Firefox(capabilities=firefox_capabilities)
   driver.get('http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061')
   soup2 = BeautifulSoup(driver.page_source,"lxml")
   for link in soup2.select(".more.review"):
      try:
         containers=page_soup.findAll("div",{"class":"more reviewdata"})
         count=len(containers)
         for index in range(count):
           count1=len(containers[index].p)
           for i in range(count1):
             profile5 = link.select("p:nth-of-type(i)")[0]
      except:pass
         print(profile5.text)
   driver.quit()

I am getting the output for name, location, time and title of the review but I am unable to get the full review of a user. I would be grateful, if anyone could help me in getting the output for the same, along with the optimization of my code (i.e) I want my code to extract the required data by loading the web page only once. Also, It would be very helpful for me if someone could help me in extracting all the customer reviews of Jio from all the webpages of the website.

Upvotes: 1

Views: 909

Answers (1)

SIM
SIM

Reputation: 22440

You can achieve the same with few lines of code along with lesser pain. However, I've defined here three main categories, as in name, review_title, review_data and the rest of the fields you can twitch very easily.

This is how you can do alternatively:

from selenium import webdriver;import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061")
wait = WebDriverWait(driver, 10)

for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".review-article"))):
    link = item.find_element_by_css_selector(".reviewdata a")
    link.click()
    time.sleep(2)

    name = item.find_element_by_css_selector("p a").text
    review_title = item.find_element_by_css_selector("strong a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews]").text
    review_data = ' '.join([' '.join(items.text.split()) for items in item.find_elements_by_css_selector(".reviewdata")])
    print("Name: {}\nReview_Title: {}\nReview_Data: {}\n".format(name, review_title, review_data))

driver.quit()

Or to do the same combinedly (selenium + bs4):

from bs4 import BeautifulSoup
from selenium import webdriver;import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061")
wait = WebDriverWait(driver, 10)

for items in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".review-article"))):
    link = items.find_element_by_css_selector(".reviewdata a")
    link.click()
    time.sleep(2)

soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select(".review-article"):
    name = item.select("p a")[0].text
    review_title = item.select("strong a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews]")[0].text
    review_data = ' '.join([' '.join(items.text.split()) for items in item.select(".reviewdata")])
    print("Name: {}\nReview_Title: {}\nReview_Data: {}\n".format(name, review_title, review_data))

driver.quit()

Upvotes: 1

Related Questions