Reputation: 311
Here is my code so far: http://pastebin.com/CdUiXpdf
import requests
from bs4 import BeautifulSoup
def web_crawler(max_pages):
page = 1
while page <= max_pages:
url = "https://www.kupindo.com/Knjige/artikli/1_strana_" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
print("PAGE: " + str(page))
for link in soup.find_all("a", class_="item_link"):
href = link.get("href")
# title = link.string
print(href)
# print(title)
extended_crawler(href)
page += 1
def extended_crawler(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for view_counter in soup.find_all("span", id="BrojPregleda"):
print("View Count: ", view_counter.text)
web_crawler(1)
The output is for example
PAGE: 1
https://www.kupindo.com/showcontent/2143/Beletristika/37875219_VUK-DRASKOVIC-Izabrana-dela-1-7-Srpska-rec
View Count:
So the View Count is empty, even tho there is the expanded_crawler function which looks for span with id of BrojPregleda, nothing displays.
Upvotes: 0
Views: 41
Reputation: 8382
Thats because the span which has the ID BrojPregleda is being populated via an ajax call. Either use Selenium to get the value or follow these steps:
1) Get the ID from the product in the URL
2) Post into http://www.kupindo.com/inc/ajx/Predmet/ajxGetBrojPregleda.php
with a single FormData key - IDPredmet
with the value of 1)
3) Get the view count
Example:
def extended_crawler(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
ViewCount = requests.post('http://www.kupindo.com/inc/ajx/Predmet/ajxGetBrojPregleda.php', data = {'IDPredmet': item_url[item_url.rfind('/') + 1:item_url.rfind('_')]})
print (ViewCount.text)
Upvotes: 1