Reputation: 125
I wish to extract the director & actor elements from this parsed html output of IMDB top 250 page. How should the python one liner for it look like? The "text-muted text-small" appears multiple times, and find_all does not seem to be the optimum way to go about it.
<span class="ipl-rating-selector__rating-value">0</span>
</div>
<div class="ipl-rating-selector__error ipl-rating-selector__wrapper">
<span>Error: please try again.</span>
</div>
</div>
<div class="ipl-rating-interactive__loader">
<img alt="loading" src="https://m.media-amazon.com/images/G/01/IMDb/spinning-progress.gif"/>
</div>
</div>
</div>
<div class="inline-block ratings-metascore">
<span class="metascore favorable">80 </span>
Metascore
</div>
<p class="">
Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.</p>
<p class="text-muted text-small">
Director:
<a href="/name/nm0001104/">Frank Darabont</a>
<span class="ghost">|</span>
Stars:
<a href="/name/nm0000209/">Tim Robbins</a>,
<a href="/name/nm0000151/">Morgan Freeman</a>,
<a href="/name/nm0348409/">Bob Gunton</a>,
<a href="/name/nm0006669/">William Sadler</a>
</p>
<p class="text-muted text-small">
<span class="text-muted">Votes:</span>
<span data-value="2187696" name="nv">2,187,696</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="28,341,469" name="nv">$28.34M</span>
</p>
<div class="wtw-option-standalone" data-baseref="wl_li" data-tconst="tt0111161" data-watchtype="minibar"></div>
</div>
Upvotes: 0
Views: 106
Reputation: 166
This will select the containing p tag and iterate over it's children, printing out Directors and Actors separately:
director_and_stars_tag = soup.select_one('p:contains("Director:")')
directors_flag = True
for name_tag in director_and_stars_tag.findChildren():
if directors_flag:
# These are Director tags
if ('span' in name_tag.name):
directors_flag = False
else:
print('Director: %s' % name_tag.string)
else:
# These are Actor tags
print('Actor: %s' % name_tag.string)
Output:
Director: Frank Darabont
Actor: Tim Robbins
Actor: Morgan Freeman
Actor: Bob Gunton
Actor: William Sadler
Upvotes: 1
Reputation: 1726
If you are using BeautifulSoup 4.7.0 or higher, you can use the :contains
CSS selector:
soup = BeautifulSoup(your_html)
soup.select_one('p:contains("Director:","Stars:")')
Upvotes: 1
Reputation: 920
If there's no id or class that you can use to identify those specific elements,
You can simply iterate through your items and check if they contain what you're looking for.
A working example on your html sample would be
details = soup.find_all("p", attrs={"class": "text-muted text-small"})
for element in details:
if "Stars" in element.text:
stars = element.find_all("a")
for star in stars:
print(star.text)
Upvotes: 0