Reputation: 831
I'm practicing BeautifulSoup by scraping imdb.com and for a given actor I would like to
So far for all films I can get something like the following soup:
<div class="filmo-row even" id="actor-tt14677742">
<span class="year_column">2021</span>
<b><a href="/title/tt14677742/">Welcome Back Future</a></b>
(Short)
<br/>
Leo
</div>
As we can see, this film should be filtered out, because it's a short one. We can also see that the info about that (Short)
is not wrapped in any tags.
Thus, My question:
How can I get this information from the soup, how can I look for some info after </b>
if there is any at all?
Upvotes: 2
Views: 1078
Reputation: 831
I don't know much about bs4
, but somehow looking for next_sibling
and that solved my problem.
So I do this:
category = movie_soup.find_all('b')[0].next_sibling
if 'TV' in category or 'Short' in category or 'Series' in category or 'Video' in category or 'Documentary' in category:
return None, None
and if I find the movie that I don't need because it falls into one of the categories I don't need, I return None, None. I know it's not the best piece of code style-wise, but it works for me.
Upvotes: 0
Reputation: 11515
BTW, I'm not sure what you are looking for it. But based on comments and the other answer.
Below should achieve your goal.
from bs4 import BeautifulSoup
html = '''<div class="filmo-row even" id="actor-tt14677742">
<span class="year_column">2021</span>
<b><a href="/title/tt14677742/">Welcome Back Future</a></b>
(Short)
<br/>
Leo
</div>'''
soup = BeautifulSoup(html, 'lxml')
print(list(soup.select_one('.filmo-row').stripped_strings))
Output:
['2021', 'Welcome Back Future', '(Short)', 'Leo']
Upvotes: 2
Reputation: 3987
You can use this:
from bs4 import BeautifulSoup as bs
HTML="""<div class="filmo-row even" id="actor-tt14677742">
<span class="year_column">2021</span>
<b><a href="/title/tt14677742/">Welcome Back Future</a></b>
(Short)
<br/>
Leo
</div>
"""
soup=bs(HTML,"lxml")
print(soup.find("div").find_all(text=True,recursive=False))
# ['\n', '\n', '\n (Short)\n ', '\n Leo\n']
# If you use html5lib as parse then answer is a bit different:
soup=bs(HTML,"html5lib")
print(soup.find("div").find_all(text=True,recursive=False))
# ['\n ', '\n ', '\n (Short)\n ', '\n Leo\n']
# If you want all of the text from div then try this:
print(soup.find("div").find_all(text=True,recursive=True))
# ['\n', '2021', '\n', 'Welcome Back Future', '\n (Short)\n ', '\n Leo\n']
# Or simply use
print(soup.find("div").text)
"""
2021
Welcome Back Future
(Short)
Leo
"""
I think you can clean it now, and I believe get the list of all films they starred in as an actor; mean you also need Leo
.
Upvotes: 2