alekscooper
alekscooper

Reputation: 831

Extracting the text which is not in between tags with BeautifulSoup

I'm practicing BeautifulSoup by scraping imdb.com and for a given actor I would like to

  1. get the list of all films they starred in as an actor;
  2. filter our all films that are not full-length features, i.e. TV series, short films, short documentaries, etc.

So far for all films I can get something like the following soup:

<div class="filmo-row even" id="actor-tt14677742">
    <span class="year_column">2021</span>
    <b><a href="/title/tt14677742/">Welcome Back Future</a></b>
     (Short)
    <br/>
     Leo
</div>

As we can see, this film should be filtered out, because it's a short one. We can also see that the info about that (Short) is not wrapped in any tags.
Thus, My question:
How can I get this information from the soup, how can I look for some info after </b> if there is any at all?

Upvotes: 2

Views: 1078

Answers (3)

alekscooper
alekscooper

Reputation: 831

I don't know much about bs4, but somehow looking for next_sibling and that solved my problem.

So I do this:

category = movie_soup.find_all('b')[0].next_sibling
if 'TV' in category or 'Short' in category or 'Series' in category or 'Video' in category or 'Documentary' in category:
    return None, None

and if I find the movie that I don't need because it falls into one of the categories I don't need, I return None, None. I know it's not the best piece of code style-wise, but it works for me.

Upvotes: 0

BTW, I'm not sure what you are looking for it. But based on comments and the other answer.

Below should achieve your goal.

from bs4 import BeautifulSoup


html = '''<div class="filmo-row even" id="actor-tt14677742">
    <span class="year_column">2021</span>
    <b><a href="/title/tt14677742/">Welcome Back Future</a></b>
     (Short)
    <br/>
     Leo
</div>'''


soup = BeautifulSoup(html, 'lxml')
print(list(soup.select_one('.filmo-row').stripped_strings))

Output:

['2021', 'Welcome Back Future', '(Short)', 'Leo']

Upvotes: 2

imxitiz
imxitiz

Reputation: 3987

You can use this:

from bs4 import BeautifulSoup as bs

HTML="""<div class="filmo-row even" id="actor-tt14677742">
    <span class="year_column">2021</span>
    <b><a href="/title/tt14677742/">Welcome Back Future</a></b>
     (Short)
    <br/>
     Leo
</div>
"""

soup=bs(HTML,"lxml")

print(soup.find("div").find_all(text=True,recursive=False))
# ['\n', '\n', '\n     (Short)\n    ', '\n     Leo\n']

# If you use html5lib as parse then answer is a bit different:
soup=bs(HTML,"html5lib")
print(soup.find("div").find_all(text=True,recursive=False))
# ['\n    ', '\n    ', '\n     (Short)\n    ', '\n     Leo\n']

# If you want all of the text from div then try this:
print(soup.find("div").find_all(text=True,recursive=True))
# ['\n', '2021', '\n', 'Welcome Back Future', '\n     (Short)\n    ', '\n     Leo\n']
# Or simply use
print(soup.find("div").text)
"""
2021
Welcome Back Future
     (Short)

     Leo

"""

I think you can clean it now, and I believe get the list of all films they starred in as an actor; mean you also need Leo.

Upvotes: 2

Related Questions