Reputation: 137
I've written a script in python using requests module and BeautifulSoup library to get the name of different persons under this title Browse Our Offices
from a website. The thing is when I run my script then it get random names which have been populated automatically, meaning without selecting any tab.
When visiting that page you can see that those tabs are like the image below:
I would like to make the selection like the image below. To be clearer - I wanna select the United states
tab and then select each of the states
to parse the names connected to them. That's it.
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://www.schooleymitchell.com/offices/"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("#consultant_info > strong"):
print(item.text)
The above script produces random names but I wish to get the names connected to United States
tab.
How can I get all the names populated upon selecting United States
and it's different states
tabs without using selenium?
Upvotes: 0
Views: 47
Reputation: 11101
First scrape all people, then filter them using their id
s which is formatted like {city}-{state}-{country}
. One problem is that spaces in multi-word state/city names are replaced with dashes -
. But we can handle it easily by creating a lookup table using the state list on the left sidebar.
Here's how:
import requests
from bs4 import BeautifulSoup
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'
})
res.raise_for_status()
return BeautifulSoup(res.text, 'html.parser')
def extract_people(soup: BeautifulSoup) -> list:
people = []
state_ids = {s['id']: s.text.strip()
for s in soup.select('#state-usa .state')}
for person in soup.select('#office_box .office'):
person_id = person['id']
location, country = person['id'].rsplit('-', 1)
if country != 'usa':
continue
state, city = None, None
for k in state_ids.keys():
if k in location:
state = state_ids[k]
city = location.replace(k, '').replace('-', ' ').strip()
break
name = person.select_one('#consultant_info > strong').text.strip()
contact_url = person.select_one('.contact-button')['href']
p = {
'name': name,
'state': state,
'city': city,
'contact_url': contact_url,
}
people.append(p)
return people
if __name__ == "__main__":
url = 'https://www.schooleymitchell.com/offices/'
soup = make_soup(url)
people = extract_people(soup)
print(people)
output:
[
{'name': 'Steven Bremer', 'state': 'Alabama', 'city': 'Gadsden', 'contact_url': 'https://www.schooleymitchell.com/sbremer/contact'},
{'name': 'David George', 'state': 'Alabama', 'city': 'Montgomery', 'contact_url': 'https://www.schooleymitchell.com/dgeorge/contact'},
{'name': 'Zachary G. Madrigal, MBA', 'state': 'Arizona', 'city': 'Phoenix', 'contact_url': 'https://www.schooleymitchell.com/zmadrigal/contact'},
...
]
Upvotes: 1
Reputation: 195528
The important data is in <div>
tag with id="office_box"
. You are interested only with consultants that are inside <div>
that ends with -usa
. First column contains name, second city and state:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.schooleymitchell.com/offices/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for div in soup.select('#office_box div[id*="-usa"] div.consultant_info_container'):
for a in div.select('a'):
a.extract()
info = div.get_text(separator=" ").strip()
info = re.split(r'\s{2}', info)
for data in info:
print('{: ^45}'.format(data), end='|')
print()
Prints:
Steven Bremer | Gadsden, Alabama | Voice: 256-328-2485 |
David George | Montgomery, Alabama | Voice: 334-649-7535 |
Zachary G. Madrigal, MBA | Phoenix, Arizona | Voice: 602-677-7804 |
Richard E. Perraut Jr. | Phoenix, Arizona | Voice: 480-659-3831 |
Stephen Moore B.A. | Scottsdale, Arizona | Voice: 480-354-3423 Toll-Free: 866-213-5141 |
Danny Caballes | Tempe, Arizona | Voice: 480-592-0776 |
Brian Lutz | Tucson, Arizona | Voice: 520-447-7921 Toll-Free: 888-633-1451 |
Travis McElroy | Bakersfield, California | Voice: 800-361-4578 |
Matt Denburg | Orange County, California | Affiliated Office | Bottomline Consulting Group, Inc.| Voice: 714-482-6025 |
Pete Craigmile | San Diego, California | Voice: |
Greg Lowry | San Francisco, California | Affiliated Office | DBA Lowry Telecom Consultant| Voice: 415-692-0708 Ext 1 |
Dave Tankersley | Colorado Springs, Colorado | Voice: 719-266-1098 |
Sanjay Tyagi | Denver, Colorado | Voice: 303-317-3110 |
Richard Ray | Highlands Ranch, Colorado | Voice: 303-306-8568 |
Richard Norlin | Highlands Ranch, Colorado | Voice: 612-309-5451 |
Dave Dellacato | Bridgeport, Connecticut | Voice: 203-442-1311 |
Patrick Delehanty | Brookfield, Connecticut | Voice: 475-289-2325 |
Greg Wisz | Fairfield County, Connecticut | Voice: 616-884-0058 |
Jack McCullough | Fairfield County, Connecticut | Voice: 203-767-5551 |
Matthew McCarthy | Hartford, Connecticut | Voice: 203-304-9886 |
Paul Nelson BS CHE, MBA | Hartford, Connecticut | Voice: 860-926-4260 |
...and so on.
Upvotes: 2