DCN
DCN

Reputation: 137

Scrape website with Python with javascript format

I don't have much experience scraping data from websites. I normally use Python "requests" and "BeautifulSoup".

I need to download the table from here https://publons.com/awards/highly-cited/2019/ I do the usual with right click and Inspect, but the format is not the one I'm used to working with. I did a bit of reading and seems to be Javascript, where I could potentially extract the data from https://publons.com/static/cache/js/app-59ff4a.js. I read other posts that recommend Selenium and PhantomJS. However, I can't modify the paths as I'm not admin in this computer (I'm using Windows). Any idea on how to tackle this? Happy to go with R if Python isn't an option.

Thanks!

Upvotes: 0

Views: 122

Answers (1)

QHarr
QHarr

Reputation: 84465

If you monitor the web traffic via dev tools you will see the API calls the page makes to update content. The info returned is in json format.

For example: page 1

import requests

r = requests.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()

You can alter the page param in a loop to get all results.

The total number of results is already indicated in the first response via r['count'] so easy enough to calculate the # pages to loop for at 10 results per page. Just be sure to be polite in how you make your requests.

Outline:

import math, requests

with requests.Session() as s:
    r = s.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()
    #do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?
    number_pages = math.ceil(r['count']/10)

    for page in range(2, number_pages + 1):
        #perhaps have a delay after X requests
        r = s.get(f'https://publons.com/awards/api/2019/hcr/?page={page}&per_page=10').json()
        #do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?

Upvotes: 3

Related Questions