iterate over a set of URLs and gather the output of data in CSV formate

Question

update as of march 19th 2023

andrej - you told me that i can do like so:

You can iterate over tags with class="hubCardTitle" and next element afterward using zip():

import requests
import pandas as pd
from bs4 import BeautifulSoup

urls = [
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1349/view",
]


out = []
for url in urls:
    print(f"Getting {url}")
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    d = {"URL": url, "Title": soup.h2.text}

    titles = soup.select("div.hubCardTitle")
    content = soup.select("div.hubCardTitle + div")

    for t, c in zip(titles, content):
        t = t.get_text(strip=True)
        c = c.get_text(strip=True, separator="
")
        d[t] = c

    out.append(d)

df = pd.DataFrame(out)
df.to_csv("data.csv", index=False)

Creates data.csv (screenshot from LibreOffice):

is getting this error on google-Colab

Getting https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/urllib3/connection.py in _new_conn(self)
    173         try:
--> 174             conn = connection.create_connection(
    175                 (self._dns_host, self.port), self.timeout, **extra_kw

15 frames
gaierror: [Errno -5] No address associated with hostname

During handling of the above exception, another exception occurred:

NewConnectionError                        Traceback (most recent call last)
NewConnectionError: : Failed to establish a new connection: [Errno -5] No address associated with hostname

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
MaxRetryError: HTTPSConnectionPool(host='s3platform-legacy.jrc.ec.europa.eu', port=443): Max retries exceeded with url: /digital-innovation-hubs-tool/-/dih/3480/view (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -5] No address associated with hostname'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    517                 raise SSLError(e, request=request)
    518 
--> 519             raise ConnectionError(e, request=request)
    520 
    521         except ClosedPoolError as e:

ConnectionError: HTTPSConnectionPool(host='s3platform-legacy.jrc.ec.europa.eu', port=443): Max retries exceeded with url: /digital-innovation-hubs-tool/-/dih/3480/view (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -5] No address associated with hostname'))

full story:

similar to this thread and task scrape with BS4 Wikipedia text (pair each heading with paragraphs associated) - and output it to CSV-format

i have a question: how to iterate over a set of 700 Urls to get the data of 700 digital hubs in CSV (or Excel-formate)-

see the page where we have the datasts: shown here

https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool

with the list of urls like here;

https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1349/view

and so on and so forth

question: can we apply this to the similar task too: see the collection of data as digital hubs: i have applied a scraper to a single site with this- and it works - but how to achieve a csv-output to the scraper that iterates on the urls can we put the output into csv too - while applying the same technique!?

i want to pair web scraped paragraphs with the most recent scraped heading from the hubCards: I am currently scraping the hubCards as single pages to find the method, however, I would like to get all the 700 Cards scraping scraped with the headings so I can see the data together in a CSV file. i want to write it to results to an appropiate formate - which may be a csv file. Note: we have the following h2 headings;

see a result-page: https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1520/view?_eu_europa_ec_jrc_dih_web_DihWebPortlet_backUrl=%2Fdigital-innovation-hubs-tool

see the set of data: note: we have the following heading on each HubCard:

Title: (probably a h4 tag)
Contact: 
Description:
'Organization', 
'Evolutionary Stage', 
'Geographical Scope', 
'Funding', 
'Partners', 
'Technologies'

what i have for a single page is this:

from bs4 import BeautifulSoup
import requests

page_link = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view'
page_response = requests.get(page_link,verify=False, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
textContent = []
for tag in page_content.find_all('h4')[1:]:
    texth4=tag.text.strip()
    textContent.append(texth4)
    for item in tag.find_next_siblings('p'):
        if texth4 in item.find_previous_siblings('h4')[0].text.strip():
            textContent.append(item.text.strip())

print(textContent)

output in console:

Description', 'Link to national or regional initiatives for digitising industry', 'Market and Services', 'Service Examples', 'Leveraging the holding system "EndoTAIX" from scientific development to ready-to -market', 'For one of SurgiTAIX AG\'s products, the holding system "EndoTAIX" for surgical instrument fixation, the SurgiTAIX AG cooperated very closely with the RWTH University\'s Helmholtz institute. The services provided comprised the complete first phase of scientific development. Besides, after the first concepts of the holding system took shape, a prototype was successfully build in the scope of a feasibility study. In the role regarding the self-conception as a transfer service provider offering services itself, the SurgiTAIX AG refined the technology to market level and successfully performed all the steps necessary within the process to the approval and certification of the product. Afterwards, the product was delivered to another vendor with SurgiTAIX AG carrying out the production process as an OEM.', 'Development of a self-adapting robotic rehabilitation system', 'Based on the expertise of different partners of the hub, DIERS International GmbH (SME) was enabled to develop a self-adapting robotic rehabilitation system that allows patients after stroke to relearn motion patterns autonomously. The particular challenge of this cooperation was to adjust the robot to the individual and actual needs of the patient at any particular time of the exercise. Therefore, different sensors have been utilized to detect the actual movement performance of the patient. Feature extraction algorithms have been developed to identify the actual needs of the individual patient and intelligent predicting control algorithms enable the robot to independently adapt the movement task to the needs of the patient. These challenges could be solved only by the services provided by different partners of the hub which include the transfer of the newly developed technologies, access to patient data, acquisition of knowledge and demands from healthcare personal and coordinating the application for public funding.', 'Establishment of a robotic couch lab and test facility for radiotherapy', 'With the help of services provided by different partners of the hub, the robotic integrator SME BEC GmbH was given the opportunity to enhance their robotic patient positioning device "ExaMove" to allow for compensation of lung tumor movements during free breathing. The provided services solved the need to establish a test facility within the intended environment (the radiotherapy department) and provided the transfer of necessary innovative technologies such as new sensors and intelligent automatic control algorithms. Furthermore, the provided services included the coordination of the consortium, identifying, preparing and coordinating the application for public funding, provision of access to the hospital’s infrastructure and the acquisition of knowledge and demands from healthcare personal.', 'Organization', 'Evolutionary Stage', 'Geographical Scope', 'Funding', 'Partners', 'Technologies']

so far so good: what is aimed now is to have a nice solution: how to iterate over a set of the 700 Urls (in other words the 700 hubCards) to get the data of 700 digital hubs in CSV (or Excel-formate)?

update:

Here's an example code that uses Python and BeautifulSoup to scrape the webpage and extract the information for each digital hub:

import requests
from bs4 import BeautifulSoup
import csv

# create a list of the URLs for each digital hub
urls = ['https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool/details/AL00106',
        'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool/details/AT00020',
        # add the rest of the URLs here
       ]

# create an empty list to store the data for each digital hub
data = []

# iterate over each URL and extract the relevant information
for url in urls:
    # make a GET request to the webpage
    response = requests.get(url)
    
    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # extract the relevant information from the HTML
    name = soup.find('h3', class_='mb-0').text.strip()
    country = soup.find('div', class_='col-12 col-md-6 col-lg-4 mb-3 mb-md-0').text.strip()
    website = soup.find('a', href=lambda href: href and 'http' in href).get('href')
    description = soup.find('div', class_='col-12 col-md-8').text.strip()
    
    # add the extracted information to the data list as a dictionary
    data.append({'Name': name, 'Country': country, 'Website': website, 'Description': description})

# write the data to a CSV file
with open('digital_hubs.csv', 'w', newline='') as csvfile:
    fieldnames = ['Name', 'Country', 'Website', 'Description']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
    writer.writeheader()
    for hub in data:
        writer.writerow(hub)

well: In this example code, we first create a list of the URLs for each digital hub. We then iterate over each URL using a for loop and extract the relevant information using BeautifulSoup. We store the extracted information for each digital hub as a dictionary in the data list. Finally, we write the data to a CSV file using the csv module.

Andrej Kesely · Accepted Answer

You can iterate over tags with class="hubCardTitle" and next element afterward using zip():

import requests
import pandas as pd
from bs4 import BeautifulSoup

urls = [
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1349/view",
]


out = []
for url in urls:
    print(f"Getting {url}")
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    d = {"URL": url, "Title": soup.h2.text}

    titles = soup.select("div.hubCardTitle")
    content = soup.select("div.hubCardTitle + div")

    for t, c in zip(titles, content):
        t = t.get_text(strip=True)
        c = c.get_text(strip=True, separator="
")
        d[t] = c

    out.append(d)

df = pd.DataFrame(out)
df.to_csv("data.csv", index=False)

Creates data.csv (screenshot from LibreOffice):

iterate over a set of URLs and gather the output of data in CSV formate

Answers (1)

Related Questions