Reputation: 15
I'm trying to only get the links that contain the text /Archive.aspx?ADID=
. However, I always get all the links on the webpage instead. After I get the links I want, how would I navigate to each of those pages?
from bs4 import BeautifulSoup, SoupStrainer
import requests
url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = '/Archive.aspx?ADID='
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
if 'Archive.aspx?ADID=' in page.text:
print(link.get('href'))
Upvotes: 1
Views: 373
Reputation: 719
It's a logic mistake.
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
if 'Archive.aspx?ADID=' in link:
print(link.get('href'))
Is the correct solution.
Your problem line was if 'Archive.aspx?ADID=' in page.text
. This is because for
every link you were grabbing, you were simply checking if the ENTIRE page (page.text
) has that piece of text, causing all of them to go through.
Upvotes: 0
Reputation: 195563
Try:
import requests
from bs4 import BeautifulSoup
url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = "Archive.aspx?ADID="
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for link in soup.find_all("a"):
if key in link.get("href", ""):
print("https://www.ci.atherton.ca.us/" + link.get("href"))
Prints:
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3581
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3570
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3564
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3559
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3556
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3554
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3552
...and so on.
Upvotes: 0