scrape_noob
scrape_noob

Reputation: 53

Scraping all URLs from search result page BeautifulSoup

I'm trying to get 100 URLs from the following search result page:

https://www.willhaben.at/iad/kaufen-und-verkaufen/marktplatz/fahrraeder-radsport/fahrraeder-4552?rows=100&areaId=900

Here's the test code I have:

import requests
from bs4 import BeautifulSoup

urls = []

def get_urls(url):
   page = requests.get(url)
   soup = BeautifulSoup(page.content,'html.parser')
   s = soup.find('a', class_="header w-brk")
   urls.append(s)
   print(urls)

Unfortunately the list returns [None]. I've also tried using href=True in the soup.find or soup.find_all method but unfortunately that doesn't work either. I can see another problem with this:

The URL the page provides in the source is for example: a href="/iad/kaufen-und-verkaufen/d/fahrrad-429985104/" just the end of the willhaben.at URL. When I do get all of these URLs appended to my list, I won't be able to scrape them just like they are, I'll have to somehow add the root URL to it before my scraper can load it.

What is the most effective way I can solve this?

Thanks!

Upvotes: 3

Views: 187

Answers (4)

Manbir Judge
Manbir Judge

Reputation: 113

This is the code you are looking for. I hope that you do not need any explanations for this code:

import requests
from bs4 import BeautifulSoup

urls = []

def get_urls(page_url):
    global urls

    page = requests.get(page_url)
    soup = BeautifulSoup(page.content, "html.parser")

    anchor_tags = soup.find_all("a", href=True)
    urls = [anchor_tag.get("href") for anchor_tag in anchor_tags]

Upvotes: 1

manzt
manzt

Reputation: 454

Checkout:

import requests
from bs4 import BeautifulSoup

urls = []

url = "https://www.willhaben.at/iad/kaufen-und-verkaufen/marktplatz/fahrraeder-radsport/fahrraeder-4552?rows=100&areaId=900"

def get_urls(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content,'html.parser')
    s = soup.findAll("div", {"class": "w-brk"})
    for link in s:
        l = link.find("a")
        urls.append("https://www.willhaben.at"+l['href'])
    print(urls)

get_urls(url)   

Upvotes: 2

Ismail Durmaz
Ismail Durmaz

Reputation: 2631

You can choose many ways to get anchor URLs.

soup.select elegant way:

urls.extend([a.attrs['href'] for a in soup.select('div.header.w-brk a')])

soup.select simpler way:

for a in soup.select('div.header.w-brk a'):
    urls.append(a.attrs['href'])

soup.find_all simpler way:

for div in soup.find_all('div', class_="header w-brk"):
    urls.append(div.find('a').attrs['href'])

soup.find_all elegant way:

urls.extend([div.find('a').attrs['href'] for div in soup.find_all('div', class_="header w-brk")])

Upvotes: 4

Oryon
Oryon

Reputation: 127

For the second part of your question you could use a simple list comprehension:

urls_with_base = [f"{base_url}/{url}" for url in urls]

Upvotes: 2

Related Questions