Kris L
Kris L

Reputation: 9

Beautiful Soup with California Senator webpage

I am new to Beautiful Soup and to HTML and after following a tutorial, am trying to scrape this webpage with California Senators. https://www.senate.ca.gov/senators My goal is to extract senators' name, party affiliation, district and capitol office phone number and ultimately put it into a pandas DataFrame. I looked at the source code, and see h3 is a tag that will be important for name/party, that address/phone is tagged with p. If I find all rows with "h3", I get 201-- more than the number of senators. I don't quite know how to drill down on just what I want to extract. I can do the request and soup it, but am not quite sure how to extract the info I need. Any help would be appreciated. I have followed a few online tutorials, but they don't cover all cases.

Latest try: import requests from bs4 import BeautifulSoup import pandas as pd

# Send a GET request to the website
url = "https://www.senate.ca.gov/senators"
response = requests.get(url)

# Use Beautiful Soup to parse the HTML   
soup = BeautifulSoup(response.content, "html.parser")

# Find the table that contains the senator    information
table = soup.find("table", {"class": "views-  table cols-4"})

# Create lists to store the data
names = []
districts = []
parties = []
phones = []

# Extract the senator information from each row in the table
for row in table.find_all("tr"):
     cells = row.find_all("td")
     if len(cells) == 4:
         name = cells[0].get_text().strip()
         district =   cells[1].get_text().strip()
         party = cells[2].get_text().strip()
         phone = cells[3].get_text().strip()
    
    # Append the data to the lists
    names.append(name)
    districts.append(district)
    parties.append(party)
    phones.append(phone)

 # Create a Pandas dataframe from the lists
 df = pd.DataFrame({"Senator Name": names,  "District": districts, "Party": parties, "Phone Number": phones})

# Print the dataframe
 print(df)

Upvotes: 0

Views: 36

Answers (1)

aardvarkk
aardvarkk

Reputation: 15996

What I find helpful when trying to scrape a webpage is to open up the webpage in a browser and look through the source manually. There are some very helpful tools that can help you hone your selectors. I haven't worked with Beautiful Soup in a long time, but generally a scraper will allow you to submit either CSS selector or XPath queries to drill down to the data you want. Both of those are usually able to be experimented with in the browser console.

firefox inspector

For example, I opened up your page in the Firefox inspector:

I've highlighted two important areas with arrows. In the "Search HTML" section, you can actually type in CSS selectors or XPath searches.

If I search for h3, I get 202 results (as you said -- more than the number of senators). You can keep finding next/previous and see what results are coming up.

searching for h3

Now use the button try highlighting a particular senator, and look at the bottom breadcrumbs section.

selected particular element

I always find it helpful to then look at the breadcrumbs or the ancestors of the element I want. In this case, I noticed there's a parent div element with the class views-field-field-senator-last-name. I try entering that into my CSS selector search, so now I'm looking for h3 tags that are descendants of elements with the class views-field-field-senator-last-name.

finding senators

And now it is showing me only the relevant senators.

Hope this helps.

Upvotes: 0

Related Questions