Piet
Piet

Reputation: 11

Webscraping Dynamic Content in Python

I am trying to get a specific number from this url: 'https://www.ulb.uni-muenster.de/' through webscraping. The number is dynamic. Unfortunately when I search for the number I only get the class, but not the number. When I inspect the url in my chrome browser I can see the number clearly in the source code. I have two approaches:

import seaborn as sns
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.ulb.uni-muenster.de/'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
tags = soup.find('span', {'class': 'seatingsCounter'})
print(tags)

Out: <span class="seatingsCounter"></span>

import requests
r = requests.get('https://www.ulb.uni-muenster.de/')
data = BeautifulSoup(r.content)
examples = []
for d in data.findAll('a'):
    examples.append(d)
my_as = soup.findAll("span", { "class" : "seatingsCounter" })

Out: [<span class="seatingsCounter"></span>]

Both of them are not working because the output is always just the class.

Upvotes: 1

Views: 71

Answers (1)

FiddleStix
FiddleStix

Reputation: 3721

If you look in the page source code, you will see that the number of free places is updated by the JavaScript function showMessage:

var showMessage = function(data) {
                var locations = [ "ZB_LS", "ZB_RS" ];
                var free = 0;
                var total = 0;
                var open = true;
                $('.availableSeatings .spinner').remove();
                $('.availableSeatings .error').data('counter', 0);
                $.each(data.locations, function( key, value ) {
                    if ($.inArray( value.id, locations) !== -1)
                    {
                        free = free + Math.round((100 - value.quota) * value.places/100);
                        total = total + value.places;
                        open = open && value.open;
                    }
                });

                if (open)
                {
                    $('.availableSeatings .message').show().siblings().hide();
                    quota = Math.round(free/total * 100);
                    result = free + '<span class="quota">(' + quota + '%)</span>';
                    date = $.format.date(data.datetime, "dd.MM.yyyy, HH:mm");
                    $('.availableSeatings .seatingsCounter').html(result);  // <- HERE!!
                    $('.availableSeatings .updated .datetime').text(date);
                    $('.availableSeatings .updated').show();
                } else {
                    $('.availableSeatings .closed').show().siblings().hide();
                }
        };

A little further down the source code you will see this line:

$.ajax({
            dataType: "json",
            url: "/available-seatings.json",  \\ <-- THIS LOOKS INTERESTING
            timeout: 40000,
            success: function(data) { showMessage(data); },
            error: function() {
                counter = $('.availableSeatings .error').data('counter');
                if (isNaN(counter) || counter >= 3)
                {
                    showError();
                } else {
                    $('.availableSeatings .error').data('counter', counter + 1);
                }
            },
            complete: function() {
              setTimeout(worker, 60000);
            }
          });

And if we go to https://www.ulb.uni-muenster.de/available-seatings.json then we see something like:

{"datetime":"2019-11-13 13:49:46","locations":[{"id":"ZB_LS","label":"Zentralbibliothek Lesesaal","open":true,"quota":99,"places":678},{"id":"ZB_RS","label":"Zentralbibliothek Recherchesaal","open":true,"quota":94,"places":154},{"id":"VSTH","label":"Bibliothek im Vom-Stein-Haus","open":true,"quota":56,"places":145},{"id":"RWS1","label":"Bibliothek im Rechtswissenschaftlichen Seminar I \/ Einzelarbeitszone","open":true,"quota":98,"places":352},{"id":"RWS1_G","label":"Bibliothek im Rechtswissenschaftlichen Seminar I \/ Gruppenarbeitszone","open":true,"quota":30,"places":40},{"id":"RWS2","label":"Bibliothek im Rechtswissenschaftlichen Seminar II","open":true,"quota":54,"places":162},{"id":"WIWI","label":"Fachbereichsbibliothek Wirtschaftswissenschaften \/ Einzelarbeitszone","open":true,"quota":71,"places":132},{"id":"WIWI_G","label":"Fachbereichsbibliothek Wirtschaftswissenschaften \/ Gruppenarbeitszone","open":true,"quota":98,"places":45},{"id":"ZBSOZ","label":"Zweigbibliothek Sozialwissenschaften","open":true,"quota":74,"places":129},{"id":"FHAUS","label":"Gemeinschaftsbibliothek im F\u00fcrstenberghaus","open":true,"quota":68,"places":197},{"id":"IFE","label":"Bibliothek des Instituts f\u00fcr Erziehungswissenschaft","open":true,"quota":47,"places":183},{"id":"PHI","label":"Bibliotheken im Philosophikum (Domplatz 23)","open":true,"quota":68,"places":98}]}

Voila, adding a Python JSON module is probably easier than re-writing to use Selenium, though that would work too.

Upvotes: 1

Related Questions