robots.txt
robots.txt

Reputation: 137

Can't get selective names connected to a certain tab in a webpage

I've written a script in python using requests module and BeautifulSoup library to get the name of different persons under this title Browse Our Offices from a website. The thing is when I run my script then it get random names which have been populated automatically, meaning without selecting any tab.

Website Link

When visiting that page you can see that those tabs are like the image below:

enter image description here

I would like to make the selection like the image below. To be clearer - I wanna select the United states tab and then select each of the states to parse the names connected to them. That's it.

enter image description here

I've tried with:

import requests
from bs4 import BeautifulSoup

link = "https://www.schooleymitchell.com/offices/"

res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("#consultant_info > strong"):
    print(item.text)

The above script produces random names but I wish to get the names connected to United States tab.

How can I get all the names populated upon selecting United States and it's different states tabs without using selenium?

Upvotes: 0

Views: 47

Answers (2)

abdusco
abdusco

Reputation: 11101

First scrape all people, then filter them using their ids which is formatted like {city}-{state}-{country}. One problem is that spaces in multi-word state/city names are replaced with dashes -. But we can handle it easily by creating a lookup table using the state list on the left sidebar.

Here's how:

import requests
from bs4 import BeautifulSoup


def make_soup(url: str) -> BeautifulSoup:
    res = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'
    })
    res.raise_for_status()
    return BeautifulSoup(res.text, 'html.parser')


def extract_people(soup: BeautifulSoup) -> list:
    people = []
    state_ids = {s['id']: s.text.strip()
                 for s in soup.select('#state-usa .state')}
    for person in soup.select('#office_box .office'):
        person_id = person['id']
        location, country = person['id'].rsplit('-', 1)
        if country != 'usa':
            continue

        state, city = None, None
        for k in state_ids.keys():
            if k in location:
                state = state_ids[k]
                city = location.replace(k, '').replace('-', ' ').strip()
                break

        name = person.select_one('#consultant_info > strong').text.strip()
        contact_url = person.select_one('.contact-button')['href']
        p = {
            'name': name,
            'state': state,
            'city': city,
            'contact_url': contact_url,
        }
        people.append(p)
    return people


if __name__ == "__main__":
    url = 'https://www.schooleymitchell.com/offices/'
    soup = make_soup(url)
    people = extract_people(soup)

    print(people)

output:

[
    {'name': 'Steven Bremer', 'state': 'Alabama', 'city': 'Gadsden', 'contact_url': 'https://www.schooleymitchell.com/sbremer/contact'}, 
    {'name': 'David George', 'state': 'Alabama', 'city': 'Montgomery', 'contact_url': 'https://www.schooleymitchell.com/dgeorge/contact'}, 
    {'name': 'Zachary G. Madrigal, MBA', 'state': 'Arizona', 'city': 'Phoenix', 'contact_url': 'https://www.schooleymitchell.com/zmadrigal/contact'}, 
    ...
]

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195528

The important data is in <div> tag with id="office_box". You are interested only with consultants that are inside <div> that ends with -usa. First column contains name, second city and state:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.schooleymitchell.com/offices/'

soup = BeautifulSoup(requests.get(url).text, 'lxml')


for div in soup.select('#office_box div[id*="-usa"] div.consultant_info_container'):
    for a in div.select('a'):
        a.extract()
    info = div.get_text(separator=" ").strip()
    info = re.split(r'\s{2}', info)
    for data in info:
        print('{: ^45}'.format(data), end='|')
    print()

Prints:

        Steven Bremer                |              Gadsden, Alabama               |             Voice: 256-328-2485             |
        David George                 |             Montgomery, Alabama             |             Voice: 334-649-7535             |
  Zachary G. Madrigal, MBA           |              Phoenix, Arizona               |             Voice: 602-677-7804             |
   Richard E. Perraut Jr.            |              Phoenix, Arizona               |             Voice: 480-659-3831             |
     Stephen Moore B.A.              |             Scottsdale, Arizona             | Voice: 480-354-3423 Toll-Free: 866-213-5141 |
       Danny Caballes                |               Tempe, Arizona                |             Voice: 480-592-0776             |
         Brian Lutz                  |               Tucson, Arizona               | Voice: 520-447-7921 Toll-Free: 888-633-1451 |
       Travis McElroy                |           Bakersfield, California           |             Voice: 800-361-4578             |
        Matt Denburg                 |          Orange County, California          | Affiliated Office | Bottomline Consulting Group, Inc.|             Voice: 714-482-6025             |
       Pete Craigmile                |            San Diego, California            |                   Voice:                    |
         Greg Lowry                  |          San Francisco, California          | Affiliated Office | DBA Lowry Telecom Consultant|          Voice: 415-692-0708 Ext 1          |
       Dave Tankersley               |         Colorado Springs, Colorado          |             Voice: 719-266-1098             |
        Sanjay Tyagi                 |              Denver, Colorado               |             Voice: 303-317-3110             |
         Richard Ray                 |          Highlands Ranch, Colorado          |             Voice: 303-306-8568             |
       Richard Norlin                |          Highlands Ranch, Colorado          |             Voice: 612-309-5451             |
       Dave Dellacato                |           Bridgeport, Connecticut           |             Voice: 203-442-1311             |
      Patrick Delehanty              |           Brookfield, Connecticut           |             Voice: 475-289-2325             |
          Greg Wisz                  |        Fairfield County, Connecticut        |             Voice: 616-884-0058             |
       Jack McCullough               |        Fairfield County, Connecticut        |             Voice: 203-767-5551             |
      Matthew McCarthy               |            Hartford, Connecticut            |             Voice: 203-304-9886             |
   Paul Nelson BS CHE, MBA           |            Hartford, Connecticut            |             Voice: 860-926-4260             |

...and so on.

Upvotes: 2

Related Questions