Martin Bodzwag
Martin Bodzwag

Reputation: 13

Scraping a website by using Beautiful Soup in Python

I am trying to scrape the website https://www.eu-startups.com/directory/wpbdp_category/austrian-startups, to get the listed information for austrian startups. The information I would like to scrape is per startup: where is it based, the tags listed and the foundation date. I am using Beautiful soup, but I have no idea how to excess this information. Right now I am able to retrieve the class .listing title, where I get the name of the startup. The problem is that I don´t know how to navigate within the .listing-details class, where the rest of the information is listed.

The current code that i am using is:

import bs4 import requests

result = requests.get('https://www.eu-startups.com/directory/wpbdp_category/austrian-startups') content = bs4.BeautifulSoup(result.text,'lxml') content.select('.listing-details')[0]

The output is:

<div class="listing-details">
<div class="wpbdp-field-display wpbdp-field wpbdp-field-value field-display field-value wpbdp-field-business_name wpbdp-field-title wpbdp-field-type-textfield wpbdp-field-association-title"><span class="field-label">Business Name</span> <div class="value"><a href="https://www.eu-startups.com/directory/shopstory/" target="_self">Shopstory</a></div></div> <div class="wpbdp-field-display wpbdp-field wpbdp-field-value field-display field-value wpbdp-field-category wpbdp-field-category wpbdp-field-type-select wpbdp-field-association-category"><span class="field-label">Category</span> <div class="value"><a href="https://www.eu-startups.com/directory/wpbdp_category/austrian-startups/" rel="tag">Austria</a></div></div> <div class="wpbdp-field-display wpbdp-field wpbdp-field-value field-display field-value wpbdp-field-based_in wpbdp-field-meta wpbdp-field-type-textfield wpbdp-field-association-meta"><span class="field-label">Based in</span> <div class="value">Vienna</div></div> <div class="wpbdp-field-display wpbdp-field wpbdp-field-value field-display field-value wpbdp-field-tags wpbdp-field-meta wpbdp-field-type-textfield wpbdp-field-association-meta"><span class="field-label">Tags</span> <div class="value">Artificial Intelligence, E-Commerce, Marketing Automation, SaaS</div></div> <div class="wpbdp-field-display wpbdp-field wpbdp-field-value field-display field-value wpbdp-field-founded wpbdp-field-meta wpbdp-field-type-select wpbdp-field-association-meta"><span class="field-label">Founded</span> <div class="value">2020</div></div>
</div>

How can I access the other tags (based in, tags and founded)?

picture of the html code for the Based in information

Upvotes: 1

Views: 149

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195508

Try:

import requests
from bs4 import BeautifulSoup


url = "https://www.eu-startups.com/directory/wpbdp_category/austrian-startups/page/1/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for l in soup.select(".wpbdp-listing"):
    title = l.a.text
    based = l.select_one("span:-soup-contains(Based) + div").text
    tags = l.select_one("span:-soup-contains(Tags) + div").text.split(", ")
    founded = l.select_one("span:-soup-contains(Founded) + div").text

    print(title, based, founded)
    print(tags)
    print()

Prints:

Shopstory Vienna 2020
['Artificial Intelligence', 'E-Commerce', 'Marketing Automation', 'SaaS']

Tubics Vienna 2017
['Advertising', 'SaaS', 'Software', 'Video', 'VideoEditing']

25superstars Vienna 2020
['content creator', 'social media']

myCulture GmbH Vienna 2022
['CultTech', 'marketplace', 'big data']

And-Less Wien 2022
['Packaging', 'Plastic waste', 'Circular economy', 'Sustainable']

heyqq – ask away Vienna 2022
['audio', 'social', 'app']

NXRT Wien 2022
['Artificial Intelligence', 'Automotive', 'Autonomous Vehicles', 'Education', 'Enterprise Software', 'Information Technology', 'Railroad', 'Software', 'Software Engineering']

ReDev Vienna 2022
['Information Technology', 'Recruiting', 'SaaS', 'Software']

Revitalyze Innsbruck 2022
['Building Material', 'Green Building', 'Logistics', 'Marketplace', 'Recycling', 'Waste Management']

Coachfident Vienna 2022
['coaching', 'personal development', 'career coaching']

Goddard – Discovery Hagenberg 2022
['Artificial Intelligence', 'Machine Learning', 'Application Development']

Upvotes: 1

Related Questions