Sanya Pushkar
Sanya Pushkar

Reputation: 190

BeautifulSoup not returning results of a search on a website

I am trying to get the links to the individual search results on a website (National Gallery of Art). But the link to the search doesn't load the search results. Here is how I try to do it:

url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

I can see that the links to the individual results could be found under soup.findAll('a') but they do not appear, instead the last output is a link to empty search result: https://www.nga.gov/content/ngaweb/collection-search-result.html

How could I get a list of links, the first of which is the first search result (https://www.nga.gov/collection/art-object-page.52389.html), the second is the second search result (https://www.nga.gov/collection/art-object-page.52085.html) etc?

Upvotes: 2

Views: 134

Answers (2)

Md. Fazlul Hoque
Md. Fazlul Hoque

Reputation: 16187

Actually, data is generating from api calls json response. Here is the desired list of links.

Code:

import requests
import json

url= 'https://www.nga.gov/collection-search-result/jcr:content/parmain/facetcomponent/parList/collectionsearchresu.pageSize__30.pageNumber__1.json?artist=C%C3%A9zanne%2C%20Paul&_=1634762134895'
r = requests.get(url)

for item in r.json()['results']:
    url = item['url']
    abs_url = f'https://www.nga.gov{url}'
    print(abs_url)

Output:

https://www.nga.gov/content/ngaweb/collection/art-object-page.52389.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52085.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46577.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46580.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46578.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.136014.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46576.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53120.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.54129.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.52165.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46575.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53122.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.93044.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.66405.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53119.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53121.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46579.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.66406.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45866.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53123.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45867.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45986.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45877.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.136025.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74193.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74192.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66486.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76288.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76223.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76268.html

Upvotes: 1

james-see
james-see

Reputation: 13206

This seems to work for me:


from bs4 import BeautifulSoup
import requests
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a'):
    print(a['href'])

It returns all of the html a href links.

For the links from the search results specifically, those are loaded via AJAX and you would need to implement something that renders the javascript like headless chrome. You can read about one of the ways to implement this here, which fits your use case very closely. http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/

If you want to ask how to render javascript from python and then parse the result, you would need to close this question and open a new one, as it is not scoped correctly as is.

Upvotes: 0

Related Questions