Reputation: 31
I'm trying to scrape the href of the first link titled "BACC B ET A COMPTABILITE CONSEIL". However, I can't seem to extract the href when I'm using BeautifulSoup. Could you please recommend a solution?
Here's the link to the url - https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160
My code:
url = 'https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
resp = requests.get(str(url), headers=headers)
soup = BeautifulSoup(resp.content, 'html.parser')
a = soup.find('div', {'class': 'nom-entreprise'})
print(a)
Result:
None.
Upvotes: 1
Views: 168
Reputation: 19998
The website uses is loaded dynamically, therefore requests
doesn't support it. We can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium
.
Download the correct ChromeDriver from here.
To find the links you can use a CSS selector: a.gros-gros-nom
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
url = "https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160"
driver = webdriver.Chrome()
driver.get(url)
# Wait for the link to be visible on the page and save element to a variable `link`
link = WebDriverWait(driver, 20).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, "a.gros-gros-nom"))
)
print(link.get_attribute("href"))
driver.quit()
Output:
https://www.pappers.fr/entreprise/bacc-b-et-a-comptabilite-conseil-378002208
Upvotes: 1
Reputation: 195408
The link is constructed dynamically with JavaScript. All you need is a number, which is obtained with Ajax query:
import json
import requests
# url = "https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160"
api_url = "https://api.pappers.fr/v2/recherche"
payload = {
"q": "B & A COMPTABILITE CONSEIL", # <-- your search query
"code_naf": "",
"code_postal": "94160", # <-- this is "ville" from URL
"api_token": "97a405f1664a83329a7d89ebf51dc227b90633c4ba4a2575",
"precision": "standard",
"bases": "entreprises,dirigeants,beneficiaires,documents,publications",
"page": "1",
"par_page": "20",
}
data = requests.get(api_url, params=payload).json()
# uncomment this to print all data (all details):
# print(json.dumps(data, indent=4))
print("https://www.pappers.fr/entreprise/" + data["resultats"][0]["siren"])
Prints:
https://www.pappers.fr/entreprise/378002208
Opening the link will automatically redirects to:
https://www.pappers.fr/entreprise/bacc-b-et-a-comptabilite-conseil-378002208
Upvotes: 2