lxml scraping overwrite error due to missing element

Question

I am currently trying to scrape user review information from imdb, including star rating the user gives, title of the review and the review text itself. However, I seem to be having a problem when a star rating is not given in a review. My code seems to override the star ratings and assume, from the moment no star rating is given, that no further star ratings are given on the page. When a star rating is missing, I just want to have the phrase "no input" appear.

Here is my code:

import lxml
from lxml import html
import requests
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get('https://www.imdb.com/title/tt0108052/reviews?ref_=tt_ql_3', headers=headers)
tree = html.fromstring(page.content)


x=tree.xpath('//div[@class="lister-item-content"]')
for index in range(len(x)):    

    Title='###Title:',(tree.xpath('//a[@class="title"]')[index]).text_content()
    Author='###Author:',(tree.xpath('//span[@class="display-name-link"]')[index]).text_content()
    Text='###Text:', (tree.xpath('//div[@class="text show-more__control"]')[index]).text_content()
    if (tree.xpath('.//div[@class="ipl-ratings-bar"]')[index]) in (tree.xpath('.//div[@class="lister-item-content"]')[index]):
        Stars=(tree.xpath('//div[@class="ipl-ratings-bar"]/span[1]/span[1]')[index]).text_content()
    else:
        Stars=('no input')
    if index <5:
        print([('###Index:', index), Stars, Title])

And this is the current output I get:

[('###Index:', 0), '10', ('###Title:', ' Bring me the head of Hitler n Himmler.
')]
[('###Index:', 1), 'no input', ('###Title:', ' The most shattering film of all time.
')]
[('###Index:', 2), 'no input', ('###Title:', " Excellent - Spielberg's Best
")]
[('###Index:', 3), 'no input', ('###Title:', ' Vehement
')]
[('###Index:', 4), 'no input', ('###Title:', " don't take this personally
")]

Index 0 and 1 are currently with "10" and "no input". However, index 3, 4 and 5 should respectively have the star ratings "9", "10" and "7". Why are the star ratings being overwritten with "no input" after the first case of a missing star rating, even though that is incorrect?

SIM · Accepted Answer

Why not try like the following to populate the result instead of indexing. I hope it will solve your current issue:

import requests
from lxml.html import fromstring

link = 'https://www.imdb.com/title/tt0108052/reviews?ref_=tt_ql_3'

page = requests.get(link, headers= {"User-Agent":"Mozilla/5.0"})
tree = fromstring(page.content)
for item in tree.xpath('//div[contains(@class,"imdb-user-review")]'):    
    title = item.xpath('.//a[@class="title"]')[0].text.strip()
    author = item.xpath('.//span[@class="display-name-link"]/a')[0].text.strip()
    text = item.xpath('.//div[starts-with(@class,"text")]')[0].text.strip()
    stars = (item.xpath('.//span[@class="rating-other-user-rating"]')+['N\A'])[0]
    if stars != "N\A": 
        stars = stars.text_content().strip()
    else:
        stars = "N\A"
    print(f'{title}
{author}
{text}
{stars}
')

lxml scraping overwrite error due to missing element

Answers (1)

Related Questions