Reputation: 43
I have been working on my first personal code (doing it myself from scratch). The idea of the code is to scrape data from a website (multiple pages). I have build function to go trough the pagination of the the website and from each page to extract the price, sq.m, publisher, etc. for each property on the page.
Well, this is the idea anyway.
At the moment all the function from line 55 down work, as well as all of the above with the expection that I cannot convert all links that I output and stored in 'urls' to soup (HTML) and saved in 'url_soup'. 'url_soup' prints out as empty and not an array for HTML pages.
My question is does can someone help me figure out why I dont see multiple HTML strings in 'url_soup'?
Please don't judge my code. I have made it to work with quite a lot of effort. I am 100% sure that it can be written much cleaner :)
Code:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import re
s = HTMLSession()
url = 'https://www.imoti.net/bg/obiavi/r/prodava/sofia/?page=1&sid=fCfR0b'
# Get all the data from the page
def getdata(url):
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
#print(soup)
return soup
def getnextpage(soup):
page = soup.find('nav', {'class': 'paginator'})
if page.find('a', {'class': 'next-page-btn'}):
url = str(page.find('a', {'class': 'next-page-btn'})['href'])
return url
else:
return
soup = getdata(url)
urls = []
urls.append(url)
while True:
soup = getdata(url)
url = getnextpage(soup)
if not url:
break
urls.append(url)
print(url)
print(urls)
url_soup = []
def urL_soup_generator(links):
for i in links:
temporary_link = getdata([i])
url_soup.append(temporary_link)
print(url_soup)
prices = []
type_of_property = []
sqm_area = []
locations =[]
publisher = []
price_per_m2 = []
def get_sqm(links):
for i in links:
for sqm in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
sqm_value = sqm.get_text().split(',')[1].split()[0]
sqm_area.append(sqm_value)
return sqm_area
def get_location(links):
for i in links:
for location in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
location_value = location.get_text().split(',')[-1].strip()
locations.append(location_value)
return locations
def get_type(links):
for i in links:
for property_type in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
property_type_value = ' '.join(property_type.get_text().split(',')[0].split()[1:3])
type_of_property.append(property_type_value)
return type_of_property
def get_publisher(links):
for i in links:
for publish in soup.find('ul', {'class': 'list-view real-estates'}).find_all('span', {'class': 're-offer-type'})[1::2]:
publish_value = publish.get_text().strip()
publisher.append(publish_value)
return publisher
def get_price_per_m2(links):
for i in links:
for price_per_m2_ in soup.find('ul', {'class': 'list-view real-estates'}).find_all('ul', {'class': 'parameters'}):
price_per_m2_value = float(price_per_m2_.get_text().strip().split('/:')[1].strip().replace('EUR', '').strip().replace(' ',''))
price_per_m2.append(price_per_m2_value)
return price_per_m2
def total_price(links):
for i in links:
for price in soup.find('ul', {'class': 'list-view real-estates'}).find_all('strong', {'class': 'price'}):
price_text = price.get_text()
price_arr = re.findall('[0-9]+', price_text)
final_price = ''
for each_sub_price in price_arr:
final_price += each_sub_price
prices.append(final_price)
return prices
# print(get_sqm(url_soup))
# print(get_location(url_soup))
# print(get_type(url_soup))
# print(get_publisher(url_soup))
# print(get_price_per_m2(url_soup))
# print(total_price(url_soup))
Upvotes: 0
Views: 47
Reputation: 531
It is kind of hard to tell what's wrong without a sample of what is getting printed out, but the problem may be that the variable url_soup
is not declared global.
Try replacing url_soup = []
with global url_soup = []
and adding global url_soup
on the first line of the urL_soup_generator
function.
Upvotes: 0
Reputation: 81
The urL_soup_generator
function never actually gets called, meaning the url_soup
list is never modified.
Upvotes: 1