tsetsko
tsetsko

Reputation: 43

Getting empty array using function inside a function

I have been working on my first personal code (doing it myself from scratch). The idea of the code is to scrape data from a website (multiple pages). I have build function to go trough the pagination of the the website and from each page to extract the price, sq.m, publisher, etc. for each property on the page.

Well, this is the idea anyway.

At the moment all the function from line 55 down work, as well as all of the above with the expection that I cannot convert all links that I output and stored in 'urls' to soup (HTML) and saved in 'url_soup'. 'url_soup' prints out as empty and not an array for HTML pages.

My question is does can someone help me figure out why I dont see multiple HTML strings in 'url_soup'?

Please don't judge my code. I have made it to work with quite a lot of effort. I am 100% sure that it can be written much cleaner :)

Code:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import re

s = HTMLSession()
url = 'https://www.imoti.net/bg/obiavi/r/prodava/sofia/?page=1&sid=fCfR0b'

# Get all the data from the page
def getdata(url):
    r = s.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    #print(soup)
    return soup

def getnextpage(soup):
    page = soup.find('nav', {'class': 'paginator'})
    if page.find('a', {'class': 'next-page-btn'}):
        url = str(page.find('a', {'class': 'next-page-btn'})['href'])
        return url
    else:
        return

soup = getdata(url)

urls = []
urls.append(url)

while True:
    soup = getdata(url)
    url = getnextpage(soup)
    if not url:
        break
    urls.append(url)
    print(url)

print(urls)

url_soup = []

def urL_soup_generator(links):
    for i in links:
        temporary_link = getdata([i])
        url_soup.append(temporary_link)

print(url_soup)

prices = []
type_of_property = []
sqm_area = []
locations =[]
publisher = []
price_per_m2 = []

def get_sqm(links):
    for i in links:
        for sqm in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
            sqm_value = sqm.get_text().split(',')[1].split()[0]
            sqm_area.append(sqm_value)
    return sqm_area

def get_location(links):
    for i in links:
        for location in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
            location_value = location.get_text().split(',')[-1].strip()
            locations.append(location_value)
    return locations

def get_type(links):
    for i in links:
        for property_type in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
            property_type_value = ' '.join(property_type.get_text().split(',')[0].split()[1:3])
            type_of_property.append(property_type_value)
    return type_of_property

def get_publisher(links):
    for i in links:
        for publish in soup.find('ul', {'class': 'list-view real-estates'}).find_all('span', {'class': 're-offer-type'})[1::2]:
            publish_value = publish.get_text().strip()
            publisher.append(publish_value)
    return publisher

def get_price_per_m2(links):
    for i in links:
        for price_per_m2_ in soup.find('ul', {'class': 'list-view real-estates'}).find_all('ul', {'class': 'parameters'}):
            price_per_m2_value = float(price_per_m2_.get_text().strip().split('/:')[1].strip().replace('EUR', '').strip().replace(' ',''))
            price_per_m2.append(price_per_m2_value)
    return price_per_m2

def total_price(links):
    for i in links:
        for price in soup.find('ul', {'class': 'list-view real-estates'}).find_all('strong', {'class': 'price'}):
            price_text = price.get_text()
            price_arr = re.findall('[0-9]+', price_text)
            final_price = ''
            for each_sub_price in price_arr:
                final_price += each_sub_price
            prices.append(final_price)
    return prices


# print(get_sqm(url_soup))
# print(get_location(url_soup))
# print(get_type(url_soup))
# print(get_publisher(url_soup))
# print(get_price_per_m2(url_soup))
# print(total_price(url_soup))

Upvotes: 0

Views: 47

Answers (2)

Marin
Marin

Reputation: 531

It is kind of hard to tell what's wrong without a sample of what is getting printed out, but the problem may be that the variable url_soup is not declared global.

Try replacing url_soup = [] with global url_soup = [] and adding global url_soup on the first line of the urL_soup_generator function.

Upvotes: 0

CrunchyBox
CrunchyBox

Reputation: 81

The urL_soup_generator function never actually gets called, meaning the url_soup list is never modified.

Upvotes: 1

Related Questions