Mike D
Mike D

Reputation: 23

Extracting Character Roles from Tom Holland's IMDB Page using BeautifulSoup

I extracted the following data from Tom Holland's IMDB page and defined it as "movie_contents":

[<div class="filmo-row odd" id="actor-tt10872600">
 <span class="year_column">
  2021
 </span>
 <b><a href="/title/tt10872600/">Untitled Spider-Man Sequel</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt10872600?rf=cons_nm_filmo">announced</a>)
 <br/>
 Peter Parker / Spider-Man
 </div>, <div class="filmo-row even" id="actor-tt1464335">
 <span class="year_column">
  2021
 </span>
 <b><a href="/title/tt1464335/">Uncharted</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt1464335?rf=cons_nm_filmo">filming</a>)
 <br/>
 Nathan Drake
 </div>, <div class="filmo-row odd" id="actor-tt2076822">
 <span class="year_column">
  2021
 </span>
 <b><a href="/title/tt2076822/">Chaos Walking</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt2076822?rf=cons_nm_filmo">post-production</a>)
 <br/>
 Todd Hewitt
 </div>, <div class="filmo-row even" id="actor-tt9130508">
 <span class="year_column">
  2020/I
 </span>
 <b><a href="/title/tt9130508/">Cherry</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt9130508?rf=cons_nm_filmo">post-production</a>)
 <br/>
 Nico Walker
 </div>, <div class="filmo-row odd" id="actor-tt7395114">
 <span class="year_column">
  2020
 </span>
 <b><a href="/title/tt7395114/">The Devil All the Time</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt7395114?rf=cons_nm_filmo">completed</a>)
 <br/>
 Arvin Russell
 </div>, <div class="filmo-row even" id="actor-tt7146812">
 <span class="year_column">
  2020/I
 </span>
 <b><a href="/title/tt7146812/">Onward</a></b>
 <br/>
 Ian Lightfoot (voice)
 </div>, <div class="filmo-row odd" id="actor-tt6673612">
 <span class="year_column">
  2020
 </span>
 <b><a href="/title/tt6673612/">Dolittle</a></b>
 <br/>
 Jip (voice)
 </div>

I'm having issuesHow can I extract all the character role names "Peter Parker / Spider-Man", "Nathan Drake", "Todd Hewitt", etc.?

Upvotes: 2

Views: 338

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195573

This script will print all roles for the actor:

import requests
from bs4 import BeautifulSoup


url = 'https://www.imdb.com/name/nm4043618/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

seen = set()
for row in soup.select('#filmo-head-actor + div .filmo-row > br'):
    role = row.find_next(text=True).strip()
    if not role in seen:
        seen.add(role)
        print(role)

Prints:

Peter Parker / Spider-Man
Nathan Drake
Todd Hewitt
Nico Walker
Arvin Russell
Ian Lightfoot (voice)
Jip (voice)
Walter (voice)
Samuel Insull
Brother Diarmuid - The Novice
Jack Fawcett
Bradley Baker
Thomas Nickerson
Tom
Gregory Cromwell
Former Billy (Encore) (uncredited)
Isaac
Eddie (voice)
Boy
Lucas
Shô (UK version, voice)

EDIT: To get the roles to DataFrame, you can do this:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://www.imdb.com/name/nm4043618/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

seen = set()
all_data = []
for row in soup.select("#filmo-head-actor + div .filmo-row > br"):
    role = row.find_next(text=True).strip()
    if not role in seen:
        seen.add(role)
        all_data.append(role)

df = pd.DataFrame(all_data, columns=["Role"])
print(df)

Prints:

                                  Role
0            Peter Parker / Spider-Man
1                         Nathan Drake
2                          Todd Hewitt
3                          Nico Walker
4                        Arvin Russell
5                Ian Lightfoot (voice)
6                          Jip (voice)
7                       Walter (voice)
8                        Samuel Insull
9        Brother Diarmuid - The Novice
10                        Jack Fawcett
11                       Bradley Baker
12                    Thomas Nickerson
13                                 Tom
14                    Gregory Cromwell
15  Former Billy (Encore) (uncredited)
16                               Isaac
17                       Eddie (voice)
18                                 Boy
19                               Lucas
20             Shô (UK version, voice)

Upvotes: 0

UWTD TV
UWTD TV

Reputation: 910

Try:

from bs4 import BeautifulSoup

html = '''<html>
 <div class="filmo-row odd" id="actor-tt10872600">
 <span class="year_column">
  2021
 </span>
 <b><a href="/title/tt10872600/">Untitled Spider-Man Sequel</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt10872600?rf=cons_nm_filmo">announced</a>)
 <br/>
 Peter Parker / Spider-Man
 </div>, <div class="filmo-row even" id="actor-tt1464335">
 <span class="year_column">
  2021
 </span>
 <b><a href="/title/tt1464335/">Uncharted</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt1464335?rf=cons_nm_filmo">filming</a>)
 <br/>
 Nathan Drake
 </div>, <div class="filmo-row odd" id="actor-tt2076822">
 <span class="year_column">
  2021
 </span>
 <b><a href="/title/tt2076822/">Chaos Walking</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt2076822?rf=cons_nm_filmo">post-production</a>)
 <br/>
 Todd Hewitt
 </div>, <div class="filmo-row even" id="actor-tt9130508">
 <span class="year_column">
  2020/I
 </span>
 <b><a href="/title/tt9130508/">Cherry</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt9130508?rf=cons_nm_filmo">post-production</a>)
 <br/>
 Nico Walker
 </div>, <div class="filmo-row odd" id="actor-tt7395114">
 <span class="year_column">
  2020
 </span>
 <b><a href="/title/tt7395114/">The Devil All the Time</a></b>
 (<a class="in_production" href="https://pro.imdb.com/title/tt7395114?rf=cons_nm_filmo">completed</a>)
 <br/>
 Arvin Russell
 </div>, <div class="filmo-row even" id="actor-tt7146812">
 <span class="year_column">
  2020/I
 </span>
 <b><a href="/title/tt7146812/">Onward</a></b>
 <br/>
 Ian Lightfoot (voice)
 </div>, <div class="filmo-row odd" id="actor-tt6673612">
 <span class="year_column">
  2020
 </span>
 <b><a href="/title/tt6673612/">Dolittle</a></b>
 <br/>
 Jip (voice)
 </div>
 '''
soup = BeautifulSoup(html, 'html.parser')


divs = soup.select('div.filmo-row.odd')
for div in divs:
    text = div.find_all(text=True, recursive=False)
    print(*[t.strip() for t in text if len(t) > 3])

prints:

Peter Parker / Spider-Man
Todd Hewitt
Arvin Russell
Jip (voice)

Upvotes: 0

Related Questions