lela_rib
lela_rib

Reputation: 147

Python dict from HTML table

I'm trying to use Beautiful Soup to get the a HTML table to a python dict. But since the table has multi levels, I'm not being able to properly save the information.

Here's what've tried:

from bs4 import BeautifulSoup

url = 'https://www.imdb.com/title/tt8579674/awards'
response = requests.get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')

award_list = []

for table in html_soup.find_all('table', {'class': 'awards'}):
    for tr in table.find_all('tr'):
        for title_award_outcome in tr.find_all('td', {'class': 'title_award_outcome'}):
            award_name = title_award_outcome.get_text(separator='<br/>', 
                                                      strip=True).split('<br/>', 1)[1]            

        for award_description in tr.find_all('td', {'class': 'award_description'}):
            award_description = award_description.get_text(separator='<br/>', 
                                                           strip=True).split('<br/>', 1)[0]
            award = award_name+'_'+award_description

        for title_award_outcome in tr.find_all('td', {'class': 'title_award_outcome'}):
            result = title_award_outcome.get_text(separator='<br/>', strip=True).split('<br/>', 1)[0]

            award_dict[award] = result
            award_list.append(award_dict)

print(award_list)

This is returning only the first information of the second column.

Expected result:

[{'Golden Globe_Best Motion Picture - Drama': 'Winner', 
  'Golden Globe_Best Original Score - Motion Picture': 'Nominee', 
  'Golden Globe_Best Original Score - Motion Picture': 'Nominee', 
  'BAFTA Film Award_Best Director': 'Nominee',
  'BAFTA Film Award_Outstanding British Film of the Year': 'Nominee',
   etc, etc, etc}]

Upvotes: 0

Views: 198

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195418

To create desired dictionary, you can use this example:

import requests
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/title/tt8579674/awards'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = {}
for td in soup.select('.awards td'):
    outcome, cat = td.select_one('.title_award_outcome b'), td.select_one('.award_category')
    if outcome and cat:
        current = []
        out[(outcome.get_text(strip=True), cat.get_text(strip=True))] = current
    else:
        for a in td.select('a'):
            a.extract()
        current.append(td.contents[0].strip())

# transform the dict to desired structure:
out2 = {}
for (outcome, award), v in out.items():
    for i in v:
        out2['{}_{}'.format(award, i)] = outcome

# print it
from pprint import pprint
pprint(out2)

Prints:

{'AACTA International Award_Best Direction': 'Nominee',
 'AFCA Award_Best Cinematography': 'Winner',
 'AFCA Award_Best Film Editing': 'Nominee',
 'AFCA Award_Best Score': 'Winner',
 'AFCC Award_Best Cinematography': 'Winner',
 'AFCC Award_Best Original Score': 'Winner',
 'AFCC Award_Top Ten Films': 'Nominee',
 'AFI Award_Movie of the Year': 'Winner',
 'ALFS Award_British/Irish Actor of the Year': 'Nominee',
 'ALFS Award_British/Irish Film of the Year': 'Nominee',
 'ALFS Award_Director of the Year': 'Nominee',
 'ALFS Award_Film of the Year': 'Nominee',

...and so on.

Upvotes: 1

Related Questions