Reputation: 145
I am doing some web scraping on the Rotten Tomatoes website, for example here.
I am using Python with the Beautiful Soup and lxml modules together.
I want to extract the movie info, for example: - Genre: Drama, Musical & Performing Arts
Directed By: Kirill Serebrennikov
Written By: Mikhail Idov, Lili Idova, Ivan Kapitonov, Kirill Serebrennikov, Natalya Naumenko
Written by (links): /celebrity/michael_idov, /celebrity/lily_idova, /celebrity/ivan_kapitonov, /celebrity/kirill_serebrennikov, /celebrity/natalya_naumenko
I inspected the page html to get the guidelines on the paths:
<li class="meta-row clearfix">
<div class="meta-label subtle">Rating: </div>
<div class="meta-value">NR</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">Genre: </div>
<div class="meta-value">
<a href="/browse/opening/?genres=9">Drama</a>,
<a href="/browse/opening/?genres=12">Musical & Performing Arts</a>
</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">Directed By: </div>
<div class="meta-value">
<a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>
</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">Written By: </div>
<div class="meta-value">
<a href="/celebrity/michael_idov">Mikhail Idov</a>,
<a href="/celebrity/lily_idova">Lili Idova</a>,
<a href="/celebrity/ivan_kapitonov">Ivan Kapitonov</a>,
<a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>,
<a href="/celebrity/natalya_naumenko">Natalya Naumenko</a>
</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">In Theaters: </div>
<div class="meta-value">
<time datetime="2019-06-06T17:00:00-07:00">Jun 7, 2019</time>
<span style="text-transform:capitalize"> limited</span>
</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">Runtime: </div>
<div class="meta-value">
<time datetime="P126M">
126 minutes
</time>
</div>
</li>
<li class="meta-row clearfix">
<div class="meta-label subtle">Studio: </div>
<div class="meta-value">
<a href="http://sonypictures.ru/leto/" target="movie-studio">Gunpowder & Sky</a>
</div>
</li>
I created the html objects like this:
page_response = requests.get(url, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
tree = html.fromstring(page_response.content)
For the Writer, for example, as I only need the text on the element, it fairly easy to get:
page_content.select('div.meta-value')[3].getText()
Or using the xpart for the Rating:
tree.xpath('//div[@class="meta-value"]/text()')[0]
For the desired Writer Links, where I have the issue, to access the html chunk I do this:
page_content.select('div.meta-value')[3]
Which gives:
<div class="meta-value">
<a href="/celebrity/michael_idov">Mikhail Idov</a>,
<a href="/celebrity/lily_idova">Lili Idova</a>,
<a href="/celebrity/ivan_kapitonov">Ivan Kapitonov</a>,
<a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>,
<a href="/celebrity/natalya_naumenko">Natalya Naumenko</a>
Or:
tree.xpath('//div[@class="meta-value"]')[3]
Giving:
<Element div at 0x2915a4c54a8>
The problem is that I can't extract the 'href'. The output I want is:
/celebrity/michael_idov, /celebrity/lily_idova, /celebrity/ivan_kapitonov, /celebrity/kirill_serebrennikov, /celebrity/natalya_naumenko
I have tried:
page_content.select('div.meta-value')[3].get('href')
tree.xpath('//div[@class="meta-value"]')[3].get('href')
tree.xpath('//div[@class="meta-value"]/@href')[3]
All with a null or error result. Could anyone help me out on this?
Thanks in advance! Cheers!
Upvotes: 1
Views: 5224
Reputation: 22440
Try the following scripts to get the content you are interested in. Make sure to test both of them by using different movies. I suppose they both will produce the desired output. I tried to avoid any hardcoded indices to target the content.
Using css selector:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.rottentomatoes.com/m/leto')
soup = BeautifulSoup(r.text,'lxml')
directed = soup.select_one(".meta-row:contains('Directed By') > .meta-value > a").text
written = [item.text for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
written_links = [item.get("href") for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
print(directed,written,written_links)
Using xpath:
import requests
from lxml.html import fromstring
r = requests.get('https://www.rottentomatoes.com/m/leto')
root = fromstring(r.text)
directed = root.xpath("//*[contains(.,'Directed By')]/parent::*/*[@class='meta-value']/a/text()")
written = root.xpath("//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a/text()")
written_links = root.xpath(".//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a//@href")
print(directed,written,written_links)
In case of cast, I used list comprehension so that I can use .strip()
on individual element to kick out whitespaces. normalize-space()
is the ideal option for this, though.
cast = [item.strip() for item in root.xpath("//*[contains(@class,'cast-item')]//a/span[@title]/text()")]
Upvotes: 2