Reputation: 53
I'm trying to get 100 URLs from the following search result page:
Here's the test code I have:
import requests
from bs4 import BeautifulSoup
urls = []
def get_urls(url):
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
s = soup.find('a', class_="header w-brk")
urls.append(s)
print(urls)
Unfortunately the list returns [None]
. I've also tried using href=True
in the soup.find
or soup.find_all
method but unfortunately that doesn't work either. I can see another problem with this:
The URL the page provides in the source is for example:
a href="/iad/kaufen-und-verkaufen/d/fahrrad-429985104/"
just the end of the willhaben.at
URL. When I do get all of these URLs appended to my list, I won't be able to scrape them just like they are, I'll have to somehow add the root URL to it before my scraper can load it.
What is the most effective way I can solve this?
Thanks!
Upvotes: 3
Views: 187
Reputation: 113
This is the code you are looking for. I hope that you do not need any explanations for this code:
import requests
from bs4 import BeautifulSoup
urls = []
def get_urls(page_url):
global urls
page = requests.get(page_url)
soup = BeautifulSoup(page.content, "html.parser")
anchor_tags = soup.find_all("a", href=True)
urls = [anchor_tag.get("href") for anchor_tag in anchor_tags]
Upvotes: 1
Reputation: 454
Checkout:
import requests
from bs4 import BeautifulSoup
urls = []
url = "https://www.willhaben.at/iad/kaufen-und-verkaufen/marktplatz/fahrraeder-radsport/fahrraeder-4552?rows=100&areaId=900"
def get_urls(url):
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
s = soup.findAll("div", {"class": "w-brk"})
for link in s:
l = link.find("a")
urls.append("https://www.willhaben.at"+l['href'])
print(urls)
get_urls(url)
Upvotes: 2
Reputation: 2631
You can choose many ways to get anchor URLs.
soup.select elegant way:
urls.extend([a.attrs['href'] for a in soup.select('div.header.w-brk a')])
soup.select simpler way:
for a in soup.select('div.header.w-brk a'):
urls.append(a.attrs['href'])
soup.find_all simpler way:
for div in soup.find_all('div', class_="header w-brk"):
urls.append(div.find('a').attrs['href'])
soup.find_all elegant way:
urls.extend([div.find('a').attrs['href'] for div in soup.find_all('div', class_="header w-brk")])
Upvotes: 4
Reputation: 127
For the second part of your question you could use a simple list comprehension:
urls_with_base = [f"{base_url}/{url}" for url in urls]
Upvotes: 2