Reputation: 167
I am learning python and am trying to improve my web scraping skills. I am trying to web scrape a website to get the website of all the other exhibitors on the website but my python code is not able to search for the htlm code to get the details.
<div class="m-exhibitor-entry__item__body__contacts__additional__website">
<h4>Website</h4>
<a href="http://www.aasiasteel.com/" target="_blank">http://www.aasiasteel.com/</a>
</div>
I am trying to use the below code but am not getting anything.
for i in soup.findAll( "div",attrs={"class":"m-exhibitor-entry__item__body__contacts__additional__website"}):
print(i['href'])
can someone help me with this?
Upvotes: 1
Views: 785
Reputation: 195408
The company info is loaded via Ajax request, so BeautifulSoup doesn't see it. However, you can simulate this Ajax request with requests
library.
Example:
import requests
from bs4 import BeautifulSoup
url = "https://forum.iktva.sa/exhibitors-list"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select(".m-exhibitors-list__items__item__header__title__link"):
company_url = "https://forum.iktva.sa/" + a["href"].split("'")[1]
soup2 = BeautifulSoup(requests.get(company_url).content, "html.parser")
print(
"{:40} {}".format(
soup2.select_one(".m-exhibitor-entry__item__header__title").text,
soup2.select_one("h4+a")["href"],
)
)
Prints:
Aasia Steel Industrial Group http://www.aasiasteel.com/
ADES http://investors.adihgroup.com/
AEC https://www.aecl.com
Al Rushaid Group https://www.al-rushaid.com/home.html
alfanar https://www.alfanar.com/
AlGihaz Contracting Co. https://algihaz.com/
Alkhorayef https://www.alkhorayefpetroleum.com/
Alturki Holding https://alturkiholding.com/
ArcelorMittal https://corporate.arcelormittal.com/
ARO Drilling https://www.arodrilling.com/
Baker Hughes https://www.bakerhughes.com/
Bin Quraya http://www.binquraya.com/
DPS https://www.egyptian-drilling.com/
Global Suhaimi http://globalsuhaimi.com/
Halliburton https://www.halliburton.com/
Honeywell https://www.honeywell.com/us/en
Jana Marine Services https://www.jana-ms.com/ar/
Larsen & Toubro Limited https://www.larsentoubro.com
McDermott https://www.mcdermott.com/
Mitsubishi Power https://power.mhi.com/regions/mena
EDIT: To save to CSV:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://forum.iktva.sa/exhibitors-list"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for a in soup.select(".m-exhibitors-list__items__item__header__title__link"):
company_url = "https://forum.iktva.sa/" + a["href"].split("'")[1]
soup2 = BeautifulSoup(requests.get(company_url).content, "html.parser")
all_data.append(
[
soup2.select_one(".m-exhibitor-entry__item__header__title").text,
soup2.select_one("h4+a")["href"],
]
)
print(*all_data[-1])
df = pd.DataFrame(all_data, columns=["Name", "URL"])
df.to_csv("data.csv", index=False) # <-- save to CSV
print(df)
Upvotes: 1