raptorzee
raptorzee

Reputation: 167

How to get href from a list of websites using web scraping?

I am learning python and am trying to improve my web scraping skills. I am trying to web scrape a website to get the website of all the other exhibitors on the website but my python code is not able to search for the htlm code to get the details.

<div class="m-exhibitor-entry__item__body__contacts__additional__website">
   <h4>Website</h4>
   <a href="http://www.aasiasteel.com/" target="_blank">http://www.aasiasteel.com/</a>
</div>

I am trying to use the below code but am not getting anything.

for i in soup.findAll( "div",attrs={"class":"m-exhibitor-entry__item__body__contacts__additional__website"}):
   print(i['href'])

can someone help me with this?

Upvotes: 1

Views: 785

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195408

The company info is loaded via Ajax request, so BeautifulSoup doesn't see it. However, you can simulate this Ajax request with requests library.

Example:

import requests
from bs4 import BeautifulSoup


url = "https://forum.iktva.sa/exhibitors-list"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for a in soup.select(".m-exhibitors-list__items__item__header__title__link"):
    company_url = "https://forum.iktva.sa/" + a["href"].split("'")[1]

    soup2 = BeautifulSoup(requests.get(company_url).content, "html.parser")
    print(
        "{:40} {}".format(
            soup2.select_one(".m-exhibitor-entry__item__header__title").text,
            soup2.select_one("h4+a")["href"],
        )
    )

Prints:

Aasia Steel Industrial Group             http://www.aasiasteel.com/
ADES                                     http://investors.adihgroup.com/
AEC                                      https://www.aecl.com
Al Rushaid Group                         https://www.al-rushaid.com/home.html
alfanar                                  https://www.alfanar.com/
AlGihaz Contracting Co.                  https://algihaz.com/
Alkhorayef                               https://www.alkhorayefpetroleum.com/
Alturki Holding                          https://alturkiholding.com/
ArcelorMittal                            https://corporate.arcelormittal.com/
ARO Drilling                             https://www.arodrilling.com/
Baker Hughes                             https://www.bakerhughes.com/
Bin Quraya                               http://www.binquraya.com/
DPS                                      https://www.egyptian-drilling.com/
Global Suhaimi                           http://globalsuhaimi.com/
Halliburton                              https://www.halliburton.com/
Honeywell                                https://www.honeywell.com/us/en
Jana Marine Services                     https://www.jana-ms.com/ar/
Larsen & Toubro Limited                  https://www.larsentoubro.com
McDermott                                https://www.mcdermott.com/
Mitsubishi Power                         https://power.mhi.com/regions/mena

EDIT: To save to CSV:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://forum.iktva.sa/exhibitors-list"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for a in soup.select(".m-exhibitors-list__items__item__header__title__link"):
    company_url = "https://forum.iktva.sa/" + a["href"].split("'")[1]

    soup2 = BeautifulSoup(requests.get(company_url).content, "html.parser")
    all_data.append(
        [
            soup2.select_one(".m-exhibitor-entry__item__header__title").text,
            soup2.select_one("h4+a")["href"],
        ]
    )
    print(*all_data[-1])


df = pd.DataFrame(all_data, columns=["Name", "URL"])
df.to_csv("data.csv", index=False)  # <-- save to CSV
print(df)

Upvotes: 1

Related Questions