Reputation: 73
Can website block python script to scan values from them (via BeautifulSoup)?
I use this script
import gspread
import requests
from bs4 import BeautifulSoup
URL = 'https://www.sreality.cz/hledani/prodej/byty/praha?velikost=1%2Bkk'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0'}
response = requests.get(URL, headers=headers)
#Scraping webu eurobydleni.cz
results = soup.find_all('div', attrs={'class':'text-wrap'})
for job in results:
nemovitost = job.find('span', attrs={'class':'name ng-binding'})
nemovitost_final = nemovitost.text.strip()
print(nemovitost_final)
But OUTPUT is nothing. Script start then ends quickly.
I need print what is in <span class="name ng-binding">Prodej bytu 1+kk 33 m²</span>
So OUTPUT= 'Prodej bytu 1+kk', 'Prodej bytu 1+kk', others...
Edit: Use help from @Andrej Kesely:
I try our code (in my code to insert values to Google Sheet), but I got error.
import gspread
import requests
import datetime
import json
from bs4 import BeautifulSoup
from oauth2client.service_account import ServiceAccountCredentials
from pprint import pprint
from datetime import timedelta
import time
datetime.datetime.now()
scope = [
'https://www.googleapis.com/auth/spreadsheets',
'https://www.googleapis.com/auth/drive'
]
api_url = 'https://www.sreality.cz/api/cs/v2/estates?category_main_cb=1&category_sub_cb=2&category_type_cb=1&locality_region_id=10&per_page=20'
data = requests.get(api_url).json()
#komuniakce s Excelem
data = ServiceAccountCredentials.from_json_keyfile_name("data.json", scope)
client = gspread.authorize(data)
sheet = client.open("skript").worksheet('sreality.cz')
data = sheet.get_all_records()
#zapis do LOG
sheet2 = client.open("skript").worksheet('LOG')
data = sheet2.get_all_records()
insertRow = ["sreality.cz", "START: " + str(datetime.datetime.now().strftime('%d-%m-%Y ve %H:%M:%S'))]
sheet2.insert_row(insertRow,2)
for estate in data["_embedded"]["estates"]:
insertRow = ["{:<30} {:<30} {} {}".format(estate["name"], estate["price"], estate["locality"])]
sheet.insert_row(insertRow,2)
insertRow = ["sreality.cz", "KONEC: " + str(datetime.datetime.now().strftime('%d-%m-%Y ve %H:%M:%S'))]
sheet2.insert_row(insertRow,2)
time.sleep(60)
Error:
Traceback (most recent call last):
File "c:/Skola-Projekty/python/byt/sreality.cz.py", line 34, in <module>
for estate in data["_embedded"]["estates"]:
TypeError: list indices must be integers or slices, not str
PS C:\Skola-Projekty\python\byt>
Edit2: Use help from @Andrej Kesely:
I use code, but it not split line into column. This code get all data into one line, then go to another line. I need them split into 3 column, is there way to do that with your code, please?
OUTPUT in Google sheet:
Flat: Price Address
Prodej bytu 1+kk 23 m² 2827000 Římská, Praha 2 - Vinohrady
Prodej bytu 1+kk 27 m² 4049000 Ječná, Praha 2 - Nové Město
Prodej bytu 1+kk 33 m² 6005000 Záhřebská, Praha 2 - Vinohrady
I need:
Flat: Price: Address:
Prodej bytu 1+kk 23 m² 2827000 Římská, Praha 2 - Vinohrady
Prodej bytu 1+kk 27 m² 4049000 Ječná, Praha 2 - Nové Město
Prodej bytu 1+kk 33 m² 6005000 Záhřebská, Praha 2 - Vinohrady
Upvotes: 1
Views: 168
Reputation: 195408
The data is loaded via Ajax from an external URL. You can use this example of how to load the data:
import json
import requests
api_url = "https://www.sreality.cz/api/cs/v2/estates?category_main_cb=1&category_sub_cb=2&category_type_cb=1&locality_region_id=10&per_page=20"
data = requests.get(api_url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for estate in data["_embedded"]["estates"]:
print("{:<30} {}".format(estate["name"], estate["price"]))
Prints:
Prodej bytu 1+kk 33 m² 4809347
Prodej bytu 1+kk 32 m² 5493000
Prodej bytu 1+kk 44 m² 6167000
Prodej bytu 1+kk 23 m² 2896000
Prodej bytu 1+kk 26 m² 3320000
Prodej bytu 1+kk 20 m² 2715000
Prodej bytu 1+kk 36 m² 3600000
Prodej bytu 1+kk 44 m² 4770000
Prodej bytu 1+kk 18 m² 3850000
Prodej bytu 1+kk 33 m² 5226000
Prodej bytu 1+kk 15 m² 2950000
Prodej bytu 1+kk 15 m² 2950000
Prodej bytu 1+kk 15 m² 2950000
Prodej bytu 1+kk 36 m² 5248000
Prodej bytu 1+kk 22 m² 3990000
Prodej bytu 1+kk 80 m² 6300000
Prodej bytu 1+kk 46 m² 6394000
Prodej bytu 1+kk 33 m² 3469000
Prodej bytu 1+kk 39 m² 5099000
Prodej bytu 1+kk 32 m² 4250000
Prodej bytu 1+kk 30 m² 4759000
Upvotes: 1
Reputation: 73
In edit2 I ask for help to split values into columns, so there is my final solution:
insertRow = ['sreality.cz', "{:<30}".format(estate["name"]), "{:<30}".format(estate["locality"]), "{:<30}".format(estate["price"]), str(pocet_bytu)]
sheet.insert_row(insertRow,2)
Thanks for @Andrej Kesely for help!
Upvotes: 1