Reputation: 81
I am trying to scrape data from this link https://www.seloger.com/ and I get this error, I don't understand what's wrong because I already tried this code before and it worked
import re
import requests
import csv
import json
with open("selog.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "Type", "Prix", "Code_postal", "Ville", "Departement", "Nombre_pieces", "Nbr_chambres", "Type_cuisine", "Surface"])
for i in range(1, 500):
url = str('https://www.seloger.com/list.htm?tri=initial&idtypebien=1,2&pxMax=3000000&div=2238&idtt=2,5&naturebien=1,2,4&LISTING-LISTpg=' + str(i))
r = requests.get(url, headers = {'User-Agent' : 'Mozilla/5.0'})
p = re.compile('var ava_data =(.*);\r\n\s+ava_data\.logged = logged;', re.DOTALL)
x = p.findall(r.text)[0].strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
x = re.sub(r'\s{2,}|\\r\\n', '', x)
data = json.loads(x)
f = csv.writer(open("Seloger.csv", "wb+"))
for product in data['products']:
ID = product['idannonce']
prix = product['prix']
surface = product['surface']
code_postal = product['codepostal']
nombre_pieces = product['nb_pieces']
nbr_chambres = product['nb_chambres']
Type = product['typedebien']
type_cuisine = product['idtypecuisine']
ville = product['ville']
departement = product['departement']
etage = product['etage']
writer.writerow([ID, Type, prix, code_postal, ville, departement, nombre_pieces, nbr_chambres, type_cuisine, surface])
this the error :
Traceback (most recent call last):
File "Seloger.py", line 20, in <module>
x = p.findall(r.text)[0].strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
IndexError: list index out of range
Upvotes: 0
Views: 1860
Reputation: 626691
The error occurs because sometimes there is no match, and you are trying to access a non-existing item in an empty list. The same result can be reproduced with print(re.findall("s", "d")[0])
.
To fix the issue, replace x = p.findall(r.text)[0].strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
line with
x = ''
xm = p.search(r.text)
if xm:
x = xm.group(1).strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
NOTES
p.findall(r.text)[0]
, you want to get the first match in the input, so re.search
is best here as it only returns the first matchmatchObject.grou[p(1)
if xm:
is important: if there is no match, x
will remain an empty string, else, it will be assigned the modified value in Group 1.Upvotes: 0
Reputation:
This line is wrong:
x = p.findall(r.text)[0].strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
what you need to find in text?
for working scraped on text you need change above line to:
x = r.text.strip().replace('\r\n ','').replace('\xa0',' ').replace('\\','\\\\')
and then finding something you need
Upvotes: 1