Reputation: 9
I am new to Beautiful Soup and to HTML and after following a tutorial, am trying to scrape this webpage with California Senators. https://www.senate.ca.gov/senators My goal is to extract senators' name, party affiliation, district and capitol office phone number and ultimately put it into a pandas DataFrame. I looked at the source code, and see h3 is a tag that will be important for name/party, that address/phone is tagged with p. If I find all rows with "h3", I get 201-- more than the number of senators. I don't quite know how to drill down on just what I want to extract. I can do the request and soup it, but am not quite sure how to extract the info I need. Any help would be appreciated. I have followed a few online tutorials, but they don't cover all cases.
Latest try: import requests from bs4 import BeautifulSoup import pandas as pd
# Send a GET request to the website
url = "https://www.senate.ca.gov/senators"
response = requests.get(url)
# Use Beautiful Soup to parse the HTML
soup = BeautifulSoup(response.content, "html.parser")
# Find the table that contains the senator information
table = soup.find("table", {"class": "views- table cols-4"})
# Create lists to store the data
names = []
districts = []
parties = []
phones = []
# Extract the senator information from each row in the table
for row in table.find_all("tr"):
cells = row.find_all("td")
if len(cells) == 4:
name = cells[0].get_text().strip()
district = cells[1].get_text().strip()
party = cells[2].get_text().strip()
phone = cells[3].get_text().strip()
# Append the data to the lists
names.append(name)
districts.append(district)
parties.append(party)
phones.append(phone)
# Create a Pandas dataframe from the lists
df = pd.DataFrame({"Senator Name": names, "District": districts, "Party": parties, "Phone Number": phones})
# Print the dataframe
print(df)
Upvotes: 0
Views: 36
Reputation: 15996
What I find helpful when trying to scrape a webpage is to open up the webpage in a browser and look through the source manually. There are some very helpful tools that can help you hone your selectors. I haven't worked with Beautiful Soup in a long time, but generally a scraper will allow you to submit either CSS selector or XPath queries to drill down to the data you want. Both of those are usually able to be experimented with in the browser console.
For example, I opened up your page in the Firefox inspector:
I've highlighted two important areas with arrows. In the "Search HTML" section, you can actually type in CSS selectors or XPath searches.
If I search for h3
, I get 202 results (as you said -- more than the number of senators). You can keep finding next/previous and see what results are coming up.
Now use the button try highlighting a particular senator, and look at the bottom breadcrumbs section.
I always find it helpful to then look at the breadcrumbs or the ancestors of the element I want. In this case, I noticed there's a parent div
element with the class views-field-field-senator-last-name
. I try entering that into my CSS selector search, so now I'm looking for h3
tags that are descendants of elements with the class views-field-field-senator-last-name
.
And now it is showing me only the relevant senators.
Hope this helps.
Upvotes: 0