I am not able to scrape the data in Python for following HTML

Question

I am trying to scrape the data from the MouthShut.com user review. If I am looking at the Reviews Devtools the required text of the review is inside the following tag.- more review data

                                            Ipohone 11 Pro X : Looks alike a minion having Three Eyes. yes its Seems as An Alien, But Technically Iphone is Copying features and Function of Androids and Having Custom Os Phones.Triple Camera is Great! for Wide Angle Photography.But The looks of Iphone 11 pro X isn't Good.If ...Read More

I wanted to extract only the text content of the review, Can anybody help on how to extract as there is no unique separator for it do so.

I have done the following code :

from requests import get
bse_url = 'https://www.mouthshut.com/mobile-phones/Apple-iPhone-11-Pro-Max-reviews-925993567'
response = get(url)

print(response.text[:100])
from bs4 import BeautifulSoup

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
reviews = html_soup.find_all('div', class_ = 'more reviewdata')

print(type(reviews))
print(len(reviews))

first_review = reviews[2]
first_review.div

Andrej Kesely · Accepted Answer

To scrape all reviews from the page, you can use this example. Some larger reviews are scraped separately as POST request:

import re
import requests
from textwrap import wrap
from bs4 import BeautifulSoup

base_url = 'https://www.mouthshut.com/mobile-phones/Apple-iPhone-11-Pro-Max-reviews-925993567'


data = {
    'type': 'review',
    'reviewid': -1,
    'corp': 'false',
    'catname': ''
}

more_url = 'https://www.mouthshut.com/review/CorporateResponse.ashx'

output = []
with requests.session() as s:
    soup = BeautifulSoup(s.get(base_url).text, 'html.parser')
    for review in soup.select('.reviewdata'):

        a = review.select_one('a[onclick^="bindreviewcontent"]')
        if a:
            data['reviewid'] = re.findall(r"bindreviewcontent\('(\d+)", a['onclick'])[0]
            comment = BeautifulSoup( s.post(more_url, data=data).text, 'html.parser' )
            comment.div.extract()
            comment.ul.extract()

            output.append( comment.get_text(separator=' ', strip=True) )
        else:
            review.div.extract()
            output.append( review.get_text(separator=' ', strip=True) )


for i, review in enumerate(output, 1):
    print('--- Review no.{} ---'.format(i))
    print(*wrap(review), sep='\n')
    print()

Prints:

--- Review no.1 ---
As you all know Apple products are too expensive this one is damn one
but who needs to sell his kidney to buy its look is not that much ease
than expected. For me it's 2 star phone

--- Review no.2 ---
Very disappointing product.nothing has changed in operating system,
only camera look has changed which is very odd looking.Device weight
is not light and dont fit in one hand.

--- Review no.3 ---
Ipohone 11 Pro X : Looks alike a minion having Three Eyes. yes its
Seems as An Alien, But Technically Iphone is Copying features and
Function of Androids and Having Custom Os Phones. Triple Camera is
Great! for Wide Angle Photography. But The looks of Iphone 11 pro X
isn't Good. If You Have 3 Kidneys, Then You Can Waste one of them to

... and so on.

I am not able to scrape the data in Python for following HTML

Answers (1)

Related Questions