Kyle
Kyle

Reputation: 387

python beautifulsoup dictionary table with list

I am trying to create a dictionary table with a key value for later joining that is associated to a list. Below is the code with the output that the code produces as well as the desired output. Can someone please help me achieve the desired output in dictionary with list form? Note, the second set does not have a link, when something like this occurs can a value be place here, such as "None"?

import requests
from bs4 import BeautifulSoup
from collections import defaultdict

html='<tr><td align="right">1</td><td align="left"><a href="http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka">Victoria Azarenka</a></td><td align="left">BLR</td><td align="left">1989-07-31</td></tr> <tr><td align="right">1146</td><td align="left">Brittany Lashway</td><td align="left">USA</td><td align="left">1994-04-06</td></tr>'

soup = BeautifulSoup(html,'lxml')

for cell in soup.find_all('td'):
    if cell.find('a', href=True):
        print(cell.find('a', href=True).attrs['href'])
        print(cell.find('a', href=True).text)
    else:
        print(cell.text)

'''
Output From Code:
1 --> Rank
http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka --> Website
Victoria Azarenka --> Name
BLR --> Country
1989-07-31 --> Birth Date
1146 --> Rank
Brittany Lashway --> Name
USA --> Country
1994-04-06 --> Birth Date

Desired Output: (Dictionary Table with List component)

{Key, [Rank, Website,Name, Country, Birth Date]}
Example:
{1, [1, http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka, Victoria Azarenka, BLR, 1989-07-31]}
{2, [1146, None, Brittany Lashway, USA, 1994-04-06]}
'''

Upvotes: 1

Views: 394

Answers (1)

Chiheb Nexus
Chiheb Nexus

Reputation: 9257

You can do something like this using list and dict comprehension:

from bs4 import BeautifulSoup as bs

html='<tr><td align="right">1</td><td align="left"><a href="http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka">Victoria Azarenka</a></td><td align="left">BLR</td><td align="left">1989-07-31</td></tr> <tr><td align="right">1146</td><td align="left">Brittany Lashway</td><td align="left">USA</td><td align="left">1994-04-06</td></tr>'

# Genrator to find the desired text and links
def find_link_or_text(a):
    for cell in a:
        if cell.find('a', href=True):
            yield cell.find('a', href=True).attrs['href']
            yield cell.find('a', href=True).text
        else:
            yield cell.text

# Parse data using BeautifulSoup
data = bs(html, 'lxml')
# Retrurn only a parsed data within td tag
parsed = data.find_all('td')

# Group elements by 5
sub = [list(find_link_or_text(parsed[k:k+4])) for k in range(0, len(parsed), 4)]

# put the sub dict within a key from 1 to len(sub)+1
final = {key: value for key, value in zip(range(1, len(sub) +1), sub)}
print(final)

Output:

{1: ['1', 'http://www.tennisabstract.com/cgi-bin/wplayer.cgi?p=VictoriaAzarenka', 'Victoria Azarenka', 'BLR', '1989-07-31'], 2: ['1146', 'Brittany Lashway', 'USA', '1994-04-06']}

Upvotes: 1

Related Questions