Reputation: 81
I'm trying to scrape a house's price of this link : https://www.leboncoin.fr/ventes_immobilieres/offres/ile_de_france/p-2/
And I need to know what wrong with my program ?
My program :
import csv
import requests
from bs4 import BeautifulSoup
with open("bc.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["prix", "code_postal", "description", "nombre_pieces", "surface"])
for i in range(1, 20):
url = "https://www.leboncoin.fr/ventes_immobilieres/offres/ile_de_france/p-%s/" % i
soup = BeautifulSoup(requests.get(url).text, "html.parser")
repo = soup.find(class_="undefined")
for repo in repo.find_all("li", attrs={"itemscope itemtype": "http://schema.org/Offer"}):
prix = repo.find("span", {"itemprop": "priceCurrency"})
prix = prix.text if prix else ""
writer.writerow([prix])
I get this error :
Traceback (most recent call last):
File "nv.py", line 14, in <module>
for repo in repo.find_all("li", attrs={"itemscope itemtype": "http://schema.org/Offer"}):
AttributeError: 'NoneType' object has no attribute 'find_all'
Upvotes: 1
Views: 4777
Reputation: 84465
The blocking and use of selenium has already been covered. I will show a way to get all the listings in a nice json format where you can easily extract info. If you use selenium to get to each page you can use regex to extract all the listings info on the page and pass to json.loads to generate json object, example here, you can easily parse for all the info per listing
from selenium import webdriver
import re
import json
p = re.compile(r'({"req.*).*[^\r\n]')
driver = webdriver.Chrome()
driver.get("https://www.leboncoin.fr/ventes_immobilieres/offres/ile_de_france/p-3/")
soup = bs(driver.page_source,'html.parser')
data = json.loads(p.findall(driver.page_source)[0])
listings = data['data']['ads']
for listing in listings:
print(listing)
Regex explanation:
Try it here
Upvotes: 3
Reputation: 2887
You are trying to search for something that doesn't exist in the data returned by requests
.
When you'll check requests.get(url).text
, you'll probably see something similar to:
<!--
Need permission to access data? Contact: [email protected]
-->
<html><head><title>You have been blocked</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var dd={'cid':'AHrlqAAAAAMAptz12-9nkWQAJcs_Yg==','hsh':'05B30BD9055986BD2EE8F5A199D973','t':'fe'}</script><script src="https://ct.datadome.co/c.js"></script></body></html>
what results in None
being assigned to variable repo
and interpreter is complaining about non-existing attribute find_all()
for the object of type None
.
So basically you need to make sure that you have the correct data before you start processing it. You can get the data without being blocked by using Selenium and ChromeDriver, as suggested by KunduK in his answer. You can get ChromeDriver from http://chromedriver.chromium.org/
Upvotes: 3
Reputation: 33384
Here JavaScripts render to page.You can use both selenium and beautiful soup to get the desire output.
from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Chrome('path of the chrome driver')
driver.get("https://www.leboncoin.fr/ventes_immobilieres/offres/ile_de_france/p-2/")
soup=BeautifulSoup(driver.page_source,'html.parser')
repo = soup.find(class_="undefined")
for repo in repo.find_all("li", attrs={"itemtype": "http://schema.org/Offer"}):
prix = repo.find("span", {"itemprop": "priceCurrency"})
if prix.text!='':
print(prix.text)
Upvotes: 2