Zack
Zack

Reputation: 45

Trouble returning web scraping output as dictionary

So I am attempting to scrape a website of its staff roster and I want the end product to be a dictionary in the format of {staff: position}. I am currently stuck with it returning every staff name and position as a separate string. It is hard to clearly post the output, but it essentially goes down the list of names, then the position. So for example the first name on the list is to be paired with the first position, and so on. I have determined that each name and position are a class 'bs4.element.Tag. I believe I need to take the names and the positions and make a list out of each, then use zip to put the elements in a dictionary. I have tried implementing this but nothing so far has worked. The lowest I could get to the text I need by using the class_ parameter was the individual div that the p is contained in. I am still inexperienced with python and new to web scraping, but I am relativity well versed with html and css, so help would be greatly appreciated.

# Simple script attempting to scrape 
# the staff roster off of the 
# Greenville Drive website

import requests
from bs4 import BeautifulSoup

URL = 'https://www.milb.com/greenville/ballpark/frontoffice'

page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

staff = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3')

for staff in staff:
    data = staff.find('p')
    if data:
        print(data.text.strip())

position = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6')

for position in position:
    data = position.find('p')
    if data:
        print(data.text.strip())  

# This code so far provides the needed data, but need it in a dict()

Upvotes: 3

Views: 267

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195408

Solution using CSS selectors and zip():

import requests
from bs4 import BeautifulSoup

url = 'https://www.milb.com/greenville/ballpark/frontoffice'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = {}
for name, position in zip( soup.select('div:has(+ div p) b'),
                           soup.select('div:has(> div b) + div p')):
    out[name.text] = position.text

from pprint import pprint
pprint(out)

Prints:

{'Adam Baird': 'Accountant',
 'Alex Guest': 'Director of Game Entertainment & Production',
 'Allison Roedell': 'Office Manager',
 'Amanda Medlin': 'Business and Team Operations Manager',
 'Beth Rusch': 'Director of West End Events',
 'Brady Andrews': 'Assistant Director of Facility Operations',
 'Brooks Henderson': 'Merchandise Manager',
 'Bryan Jones': 'Facilities Cleanliness Manager',
 'Cameron White': 'Media Relations Manager',
 'Craig Brown': 'Owner/Team President',
 'Davis Simpson': 'Director of Media and Creative Services',
 'Ed Jenson': 'Broadcaster',
 'Elise Parish': 'Premium Services Manager',
 'Eric Jarinko': 'General Manager',
 'Grant Witham': 'Events Manager',
 'Greg Burgess, CSFM': 'Vice President of Operations/Grounds',
 'Houghton Flanagan': 'Account Executive',
 'Jeb Maloney': 'Account Executive',
 'Jeff Brown': 'Vice President of Marketing',
 'Jenny Burgdorfer': 'Director of Merchandise',
 'Jordan Smith ': 'Vice President of Finance',
 'Katie Batista': 'Director of Sponsorships and Community Engagement',
 'Kristin Kipper': 'Events Manager',
 'Lance Fowler': 'Director of Video Production',
 'Matthew Tezza': 'Sponsor Services and Activations Manager',
 'Melissa Welch': 'Sponsorship and Community Events Manager',
 'Micah Gold': 'Senior Account Executive',
 'Mike Agostino': 'Director of Food and Beverage',
 'Molly Mains': 'Senior Account Executive',
 'Nate Lipscomb': 'Special Advisor to the President',
 'Ned Kennedy': 'Director of Inside Sales',
 'Olivia Adams': 'Inside Sales Representative',
 'Patrick Innes': 'Director of Ticket Operations',
 'Phil Bargardi': 'Vice President of Sales',
 'Roger Campana': 'Assistant Director of Food and Beverage',
 'Steve Seman': 'Merchandise / Ticketing Advisor',
 'Timmy Hinds': 'Director of Facility Operations',
 'Toby Sandblom': 'Inside Sales Representative',
 'Tyler Melson': 'Inside Sales Representative',
 'Wilbert Sauceda': 'Executive Chef',
 'Zack Pagans': 'Assistant Groundskeeper'}

Upvotes: 1

Bitto
Bitto

Reputation: 8205

BeautifulSoup has find_next() which can be used to get the next tag with the matching filters specified. Find the "staff" div and the use find_next() to get the adjacent "position" div.

import requests
from bs4 import BeautifulSoup

URL = 'https://www.milb.com/greenville/ballpark/frontoffice'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
staff_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3'
position_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6'
result = {}

for staff in soup.find_all('div', class_=staff_class):
    data = staff.find('p')
    if data:
        staff_name = data.text.strip()
        postion_div = staff.find_next('div', class_=position_class)
        postion_name = postion_div.text.strip()
        result[staff_name] = postion_name

print(result)

Output

{'Craig Brown': 'Owner/Team President', 'Eric Jarinko': 'General Manager', 'Nate Lipscomb': 'Special Advisor to the President', 'Phil Bargardi': 'Vice President of Sales', 'Jeff Brown': 'Vice President of Marketing', 'Greg Burgess, CSFM': 'Vice President of Operations/Grounds', 'Jordan Smith': 'Vice President of Finance', 'Ned Kennedy': 'Director of Inside Sales', 'Patrick Innes': 'Director of Ticket Operations', 'Micah Gold': 'Senior Account Executive', 'Molly Mains': 'Senior Account Executive', 'Houghton Flanagan': 'Account Executive', 'Jeb Maloney': 'Account Executive', 'Olivia Adams': 'Inside Sales Representative', 'Tyler Melson': 'Inside Sales Representative', 'Toby Sandblom': 'Inside Sales Representative', 'Katie Batista': 'Director of Sponsorships and Community Engagement', 'Matthew Tezza': 'Sponsor Services and Activations Manager', 'Melissa Welch': 'Sponsorship and Community Events Manager', 'Beth Rusch': 'Director of West End Events', 'Kristin Kipper': 'Events Manager', 'Grant Witham': 'Events Manager', 'Alex Guest': 'Director of Game Entertainment & Production', 'Lance Fowler': 'Director of Video Production', 'Davis Simpson': 'Director of Media and Creative Services', 'Cameron White': 'Media Relations Manager', 'Ed Jenson': 'Broadcaster', 'Adam Baird': 'Accountant', 'Mike Agostino': 'Director of Food and Beverage', 'Roger Campana': 'Assistant Director of Food and Beverage', 'Wilbert Sauceda': 'Executive Chef', 'Elise Parish': 'Premium Services Manager', 'Timmy Hinds': 'Director of Facility Operations', 'Zack Pagans': 'Assistant Groundskeeper', 'Amanda Medlin': 'Business and Team Operations Manager', 'Allison Roedell': 'Office Manager'}

Upvotes: 3

Related Questions