Newbie101
Newbie101

Reputation: 501

BeautifulSoup how to use for loops and extract specific data?

The HTML code below is from a website regarding movie reviews. I want to extract the Stars from the code below, which would be John C. Reilly, Sarah Silverman and Gal Gadot. How could I do this?

Code:

html_doc = """
<html>
    <head>
    </head>
    <body>
    <div class="credit_summary_item">
                <h4 class="inline">Stars:</h4>
            <a href="/name/nm0000604/?ref_=tt_ov_st_sm">John C. Reilly</a>,
            <a href="/name/nm0798971/?ref_=tt_ov_st_sm">Sarah Silverman</a>,
            <a href="/name/nm2933757/?ref_=tt_ov_st_sm">Gal Gadot</a>
            <span class="ghost">|</span>
            <a href="fullcredits/?ref_=tt_ov_st_sm">See full cast & crew</a>&nbsp;&raquo;
        </div>
    </body>
</html>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

My idea

I was going to use for loops to iterate through each div class until I found the class with text Stars, in which I could then extract the names. But I don't how I would code this as I am not too familiar with HTML syntax nor the module.

Upvotes: 2

Views: 141

Answers (3)

Ajax1234
Ajax1234

Reputation: 71451

You can iterate over all a tags in the credit_summary_item div:

from bs4 import BeautifulSoup as soup
*results, _ = [i.text for i in soup(html_doc, 'html.parser').find('div', {'class':'credit_summary_item'}).find_all('a')]

Output:

['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']

Edit:

_d = [i for i in soup(html_doc, 'html.parser').find_all('div', {'class':'credit_summary_item'}) if 'Stars:' in i.text][0]
*results, _ = [i.text for i in _d.find_all('a')]

Output:

['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']

Upvotes: 1

Rasko Vučinić
Rasko Vučinić

Reputation: 79

You can also use regex

stars = soup.findAll('a', href=re.compile('/name/nm.+'))
names = [x.text for x in stars]
names

# output: ['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']

Upvotes: 0

TGH
TGH

Reputation: 31

I will show how to implement this, and see that you only need to learn BeautifulSoap syntax.

First, we want to use that method findAll for the "div" tag with the attribute "class".

divs = soup.findAll("div", attrs={"class": "credit_summary_item"})

Then, we will filter all the divs without stars in it:

stars = [div for div in divs if "Stars:" in div.h4.text]

If you have only one place with start you can take it out:

star = start[0]

Then again find all the text in tag "a"

names = [a.text for a in star.findAll("a")]

You can see that I didn't used any html/css syntax, only soup. I hope it helped.

Upvotes: 1

Related Questions