mates125
mates125

Reputation: 1

Python beautifulsoup and openpyxl

So, i'm trying to use beautifulsoup for data extraction (a web crawler/scrapper), and i'm trying to iterate over each tag in the html to find the data that i want. My objective is to get an specific information and throw that into an excel sheet with the openpyxl library. Here's as example:

<table id="Table">   
    <tr>
        <th>Info A1</th>
        <th>Info B1</th>
        <th>Info C1</th>
        <th>Info D1</th>
        <th>Info E1</th>
    </tr>
    <tr>
        <th>Info A2</th>
        <th>Info B2</th>
        <th>Info C2</th>
        <th>Info D2</th>
        <th>Info E2</th>
    </tr>
</table>

Basically, what i want to do is compare all the "A number" infos on the table, and if one of them matches to the info that i have, i'll get the rest of the infos that are in the same tr, and put it into an excel file. The true table is waaaaay bigger than this one from the example, and i already had success iterating it, but i don't know how to indentify the information that i want and compare it with the info that i already have.

Upvotes: -1

Views: 221

Answers (2)

serkan
serkan

Reputation: 1

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL for the YÖK university list
yok_url = "https://www.yok.gov.tr/universiteler-listesi"

# Headers to mimic a browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

try:
    # Fetch the HTML content of the page
    response = requests.get(yok_url, headers=headers)
    response.raise_for_status()
    
    # Parse the page using BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the section containing the university list
    universities = []
    for row in soup.select("table tr"):  # Assuming data is in a table structure
        columns = row.find_all("td")
        if columns:
            name = columns[0].get_text(strip=True)
            website = columns[1].find("a")["href"] if columns[1].find("a") else "No Website"
            universities.append({"University Name": name, "Website": website})
    
    # Save results to an Excel file
    df = pd.DataFrame(universities)
    file_path = "Turkish_Universities_List.xlsx"
    df.to_excel(file_path, index=False)
    print(f"Data successfully scraped and saved to '{file_path}'.")
    
except Exception as e:
    print(f"An error occurred: {e}")

Upvotes: -1

TheConfax
TheConfax

Reputation: 184

d={}
for tr in soup.findAll('tr'):
    key = tr.text.split()[0]
    val = tr.text.split()[1:]
    d[key] = val
for key in d:
    if key in my_list:
        print(key) #prints the match from your list
        print(d[key]) #prints the values attached to the match

Creates an empty dictionary, iterates through the soup (where your table should reside) adding every A value as a key and every B/C/D/E as the key value in a list.

Then, for every key (A value) in the dictionary, check if they appear in my_list (your list of A values); if a match is found execute the print statements (which should be changed according to your needs) with key corresponding to the A value and d[key] corresponding to B/C/D/E values for the given A value.

Upvotes: 1

Related Questions