Reputation: 501
The HTML code below is from a website regarding movie reviews. I want to extract the Stars from the code below, which would be John C. Reilly
, Sarah Silverman
and Gal Gadot
. How could I do this?
Code:
html_doc = """
<html>
<head>
</head>
<body>
<div class="credit_summary_item">
<h4 class="inline">Stars:</h4>
<a href="/name/nm0000604/?ref_=tt_ov_st_sm">John C. Reilly</a>,
<a href="/name/nm0798971/?ref_=tt_ov_st_sm">Sarah Silverman</a>,
<a href="/name/nm2933757/?ref_=tt_ov_st_sm">Gal Gadot</a>
<span class="ghost">|</span>
<a href="fullcredits/?ref_=tt_ov_st_sm">See full cast & crew</a> »
</div>
</body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
I was going to use for loops to iterate through each div class
until I found the class with text Stars
, in which I could then extract the names. But I don't how I would code this as I am not too familiar with HTML syntax nor the module.
Upvotes: 2
Views: 141
Reputation: 71451
You can iterate over all a
tags in the credit_summary_item
div
:
from bs4 import BeautifulSoup as soup
*results, _ = [i.text for i in soup(html_doc, 'html.parser').find('div', {'class':'credit_summary_item'}).find_all('a')]
Output:
['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']
Edit:
_d = [i for i in soup(html_doc, 'html.parser').find_all('div', {'class':'credit_summary_item'}) if 'Stars:' in i.text][0]
*results, _ = [i.text for i in _d.find_all('a')]
Output:
['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']
Upvotes: 1
Reputation: 79
You can also use regex
stars = soup.findAll('a', href=re.compile('/name/nm.+'))
names = [x.text for x in stars]
names
# output: ['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']
Upvotes: 0
Reputation: 31
I will show how to implement this, and see that you only need to learn BeautifulSoap syntax.
First, we want to use that method findAll
for the "div" tag with the attribute "class".
divs = soup.findAll("div", attrs={"class": "credit_summary_item"})
Then, we will filter all the divs without stars in it:
stars = [div for div in divs if "Stars:" in div.h4.text]
If you have only one place with start you can take it out:
star = start[0]
Then again find all the text in tag "a"
names = [a.text for a in star.findAll("a")]
You can see that I didn't used any html/css syntax, only soup. I hope it helped.
Upvotes: 1