Get href from html with Beautiful Soup select or lxml xpath

Question

I am doing some web scraping on the Rotten Tomatoes website, for example here.

I am using Python with the Beautiful Soup and lxml modules together.

I want to extract the movie info, for example: - Genre: Drama, Musical & Performing Arts

Directed By: Kirill Serebrennikov
Written By: Mikhail Idov, Lili Idova, Ivan Kapitonov, Kirill Serebrennikov, Natalya Naumenko
Written by (links): /celebrity/michael_idov, /celebrity/lily_idova, /celebrity/ivan_kapitonov, /celebrity/kirill_serebrennikov, /celebrity/natalya_naumenko

I inspected the page html to get the guidelines on the paths:

                    
                        Rating: 
                        NR
                    


                    
                        Genre: 
                        

                                Drama, 

                                Musical & Performing Arts

                        
                    


                    
                        Directed By: 
                        

                                Kirill Serebrennikov

                        
                    


                    
                        Written By: 
                        

                                Mikhail Idov, 

                                Lili Idova, 

                                Ivan Kapitonov, 

                                Kirill Serebrennikov, 

                                Natalya Naumenko

                        
                    


                    
                        In Theaters: 
                        
                            Jun 7, 2019
                             limited
                        
                    




                    
                        Runtime: 
                        
                            
                                126 minutes
                            
                        
                    


                    
                    Studio: 
                    

                            Gunpowder & Sky

I created the html objects like this:

    page_response = requests.get(url, timeout=5)
    page_content = BeautifulSoup(page_response.content, "html.parser")
    tree = html.fromstring(page_response.content)

For the Writer, for example, as I only need the text on the element, it fairly easy to get:

page_content.select('div.meta-value')[3].getText()

Or using the xpart for the Rating:

tree.xpath('//div[@class="meta-value"]/text()')[0]

For the desired Writer Links, where I have the issue, to access the html chunk I do this:

page_content.select('div.meta-value')[3]

Which gives:


Mikhail Idov, 

                                Lili Idova, 

                                Ivan Kapitonov, 

                                Kirill Serebrennikov, 

                                Natalya Naumenko

Or:

tree.xpath('//div[@class="meta-value"]')[3]

Giving:

The problem is that I can't extract the 'href'. The output I want is:

/celebrity/michael_idov, /celebrity/lily_idova, /celebrity/ivan_kapitonov, /celebrity/kirill_serebrennikov, /celebrity/natalya_naumenko

I have tried:

page_content.select('div.meta-value')[3].get('href')
tree.xpath('//div[@class="meta-value"]')[3].get('href')
tree.xpath('//div[@class="meta-value"]/@href')[3]

All with a null or error result. Could anyone help me out on this?

Thanks in advance! Cheers!

SIM · Accepted Answer

Try the following scripts to get the content you are interested in. Make sure to test both of them by using different movies. I suppose they both will produce the desired output. I tried to avoid any hardcoded indices to target the content.

Using css selector:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.rottentomatoes.com/m/leto')
soup = BeautifulSoup(r.text,'lxml')

directed = soup.select_one(".meta-row:contains('Directed By') > .meta-value > a").text
written = [item.text for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
written_links = [item.get("href") for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
print(directed,written,written_links)

Using xpath:

import requests
from lxml.html import fromstring

r = requests.get('https://www.rottentomatoes.com/m/leto')
root = fromstring(r.text)

directed = root.xpath("//*[contains(.,'Directed By')]/parent::*/*[@class='meta-value']/a/text()")
written = root.xpath("//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a/text()")
written_links = root.xpath(".//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a//@href")
print(directed,written,written_links)

In case of cast, I used list comprehension so that I can use .strip() on individual element to kick out whitespaces. normalize-space() is the ideal option for this, though.

cast = [item.strip() for item in root.xpath("//*[contains(@class,'cast-item')]//a/span[@title]/text()")]

Get href from html with Beautiful Soup select or lxml xpath

Answers (1)

Related Questions