Abundnce10
Abundnce10

Reputation: 2210

Scraping content from AJAX onclick pop-up

I'm attempting to scape information from this page using Python: https://j2c-com.com/Euronaval14/catalogueWeb/catalogue.php?lang=gb. I'm specifically interested in the pop-up that occurs when you click on an individual exhibitor's name. The challenging part is it uses a lot of JavaScript to make AJAX calls to load the data.

I've examined the network calls when clicking on an exhibitor and it appears that the AJAX call goes to this URL (for the first exhibitor in the list, "A.I.A.D. and MOD ITALY"): https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=D000365D000365&rnd=0.005115277832373977

I understand where the cle parameter comes from (the id with the <span> tag), however, what I don't quite get is where the rnd parameter is derived. Is it simply just a random number? I tried supplying a random number with each request but the html returned is missing the actual content of the pop-up.

This leads me to believe that either the rnd attribute isn't a random number, or I need some type of cookie present in order for the actual data to come through in the request.

Here's my code so far, I'm using Requests and BeautifulSoup to parse the html:

import random
import decimal
import requests
from bs4 import BeautifulSoup

#base_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/catalogue.php?lang=gb'
base_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/cataloguerecherche.php?listeFavoris=&typeRecherche=1&typeRechSociete=&typeSociete=&typeMarque=&typeDescriptif=&typeActivite=&choixSociete=&choixPays=&choixActivite=&choixAgent=&choixPavillon=&choixZoneExpo=&langue=gb&rnd=0.1410133063327521'


def generate_random_number(i,d):
    "Produce a random between 0 and 1, with 16 decimal digits"
    return str(decimal.Decimal('%d.%d' % (random.randint(0,i),random.randint(0,d))))



r = requests.get(base_url)
soup = BeautifulSoup(r.text)

table = soup.find('table', {'id':'tableResultat'})

trs = table.findAll('tr')


for tr in trs:
    span = tr.find('span')
    cle = span.get('id')

    url = 'https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=' + cle + '&rnd=' + generate_random_number(0,9999999999999999)
    pop = requests.post(url)

    print url
    print pop.text

    break

Can you help me understand how I can successfully capture the pop-up data, or what I'm doing wrong? Thanks in advance!

Upvotes: 1

Views: 1585

Answers (1)

alecxe
alecxe

Reputation: 473833

It is not about the rnd parameter. It is completely random and filled up by Math.random() js function.

As you've suspected, it is about cookies. PHPSESSID cookie is critical to be brought with every following request. Just start a requests.Session() and use it for every request you make:

The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance.

...

# start session
session = requests.Session()

r = session.get(base_url)
soup = BeautifulSoup(r.text)

table = soup.find('table', {'id':'tableResultat'})
trs = table.findAll('tr')

for tr in trs:
    span = tr.find('span')
    cle = span.get('id')

    url = 'https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=' + cle + '&rnd=' + generate_random_number(0,9999999999999999)
    pop = session.post(url)  # <-- the POST request here contains cookies returned by the first GET call

    print url
    print pop.text

    break

It prints (see the HTML is filled up with the required data):

https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=D000365D000365&rnd=0.1625497943120751
<table class='divAdresse'>
    <tr>
        <td class='ficheAdresse' valign='top'>Via Nazionale 54<br>IT-00184 - Roma<br><img
                src='../../intranetJ2C/images/flags/IT.gif' style='margin-right:5px;'>ITALY<br><br>Phone: +39 06 488
            0247 | Fax: +39 06 482 74 76<br><br>Website: <a href='http://www.aiad.it' target='_new'>www.aiad.it</a></td>
    </tr>
</table>
<br>
<b class="divMarque">Contact:</b><br>
<font class="ficheAdresse"> Carlo Festucci - Secretary General<br>
<a href="mailto:[email protected]">[email protected]</a></font>
<br><br>
<div id='divTexte' class='ficheTexte'></div>

UPD.

The reason you were not getting the results for other exhibitors in the table is difficult to explain, but the main point here is to simulate all the consequent ajax requests being called under the hood when you click on the row in the browser:

import random
import decimal
import requests
from bs4 import BeautifulSoup

base_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/cataloguerecherche.php?listeFavoris=&typeRecherche=1&typeRechSociete=&typeSociete=&typeMarque=&typeDescriptif=&typeActivite=&choixSociete=&choixPays=&choixActivite=&choixAgent=&choixPavillon=&choixZoneExpo=&langue=gb&rnd=0.1410133063327521'
fiche_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/fiche.php'
reload_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/reload.php'
data_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php'


def generate_random_number(i,d):
    "Produce a random between 0 and 1, with 16 decimal digits"
    return str(decimal.Decimal('%d.%d' % (random.randint(0, i),random.randint(0, d))))


# start session
session = requests.Session()

r = session.get(base_url)
soup = BeautifulSoup(r.content)
for span in soup.select('table#tableResultat tr span'):
    cle = span.get('id')

    session.post(reload_url)
    session.post(fiche_url, data={'page': 'page:catalogue',
                                  'pasFavori': '1',
                                  'listeFavoris': '',
                                  'cle': cle,
                                  'stand': '',
                                  'rnd': generate_random_number(0, 9999999999999999)})
    session.post(reload_url)
    pop = session.post(data_url, data={'cle': cle,
                                       'rnd': generate_random_number(0, 9999999999999999)})

    print pop.text

Prints:

<table class='divAdresse'><tr><td class='ficheAdresse' valign='top'>Via Nazionale 54<br>IT-00184 - Roma<br><img src='../../intranetJ2C/images/flags/IT.gif' style='margin-right:5px;'>ITALY<br><br>Phone: +39 06 488 0247 | Fax: +39 06 482 74 76<br><br>Website: <a href='http://www.aiad.it' target='_new'>www.aiad.it</a></td></tr></table><br><b class="divMarque">Contact:</b><br><font class="ficheAdresse"> Carlo Festucci - Secretary General<br><a href="mailto:[email protected]">[email protected]</a></font><br><br><div id='divTexte' class='ficheTexte'></div>
<table class='divAdresse'><tr><td class='ficheAdresse' valign='top'>An der Faehre 2<br>27809 - Lemwerder<br><img src='../../intranetJ2C/images/flags/DE.gif' style='margin-right:5px;'>GERMANY<br><br>Phone: +49 421 673 30 | Fax: +49 421 673 3115<br><br>Website: <a href='http://www.abeking.com' target='_new'>www.abeking.com</a></td></tr></table><br><b class="divMarque">Contact:</b><br><font class="ficheAdresse"> Thomas Haake - Sales Director Navy</font><br><br><div id='divTexte' class='ficheTexte'></div>
<table class='divAdresse'><tr><td class='ficheAdresse' valign='top'>Mohamed Bin Khalifa Street (street 15)<br>PO Box 107241<br>107241 - Abu Dhabi<br><img src='../../intranetJ2C/images/flags/AE.gif' style='margin-right:5px;'>UNITED ARAB EMIRATES<br><br>Phone: +971 2 445 5551 | Fax: +971 2 445 0644</td></tr></table><br><b class="divMarque">Contact:</b><br><font class="ficheAdresse"> Pierre Baz - Business Development<br><a href="mailto:[email protected]">[email protected]</a></font><br><br><div id='divTexte' class='ficheTexte'></div>
...

Upvotes: 2

Related Questions