lsignori
lsignori

Reputation: 15

Parsing for Specific Text in HTML href

I'm trying to only get the links that contain the text /Archive.aspx?ADID=. However, I always get all the links on the webpage instead. After I get the links I want, how would I navigate to each of those pages?

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = '/Archive.aspx?ADID='

page = requests.get(url)    
data = page.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    if 'Archive.aspx?ADID=' in page.text: 
        print(link.get('href'))

Upvotes: 1

Views: 373

Answers (2)

dir
dir

Reputation: 719

It's a logic mistake.

page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    if 'Archive.aspx?ADID=' in link:
        print(link.get('href'))

Is the correct solution. Your problem line was if 'Archive.aspx?ADID=' in page.text. This is because for every link you were grabbing, you were simply checking if the ENTIRE page (page.text) has that piece of text, causing all of them to go through.

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195563

Try:

import requests
from bs4 import BeautifulSoup

url = "https://www.ci.atherton.ca.us/Archive.aspx?AMID=41"
key = "Archive.aspx?ADID="

soup = BeautifulSoup(requests.get(url).content, "html.parser")

for link in soup.find_all("a"):
    if key in link.get("href", ""):
        print("https://www.ci.atherton.ca.us/" + link.get("href"))

Prints:

https://www.ci.atherton.ca.us/Archive.aspx?ADID=3581
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3570
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3564
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3559
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3556
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3554
https://www.ci.atherton.ca.us/Archive.aspx?ADID=3552

...and so on.

Upvotes: 0

Related Questions