Reputation: 1311
I would like to retrieve information from Google Arts & Culture using BeautifulSoup
.
I have checked many of the stackoverflow posts ([1]
,
[2]
,
[3]
,
[4]
,
[5]
), and still couldn't retrieve the information.
I would like each tile (picture)'s (li
) information such as href, however, find_all
and select one
return empty list or None.
Could you help me get the below href value of anchor tag of class "e0WtYb HpzMff PJLMUc" ?
href="/entity/claude-monet/m01xnj?categoryId=artist"
Below are what I had tried.
import requests
from bs4 import BeautifulSoup
url = 'https://artsandculture.google.com/category/artist?tab=time&date=1850'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
print(soup.find_all('li', class_='DuHQbc')) # []
print(soup.find_all('a', class_='PJLMUc')) # []
print(soup.find_all('a', class_='e0WtYb HpzMff PJLMUc')) # []
print(soup.select_one('#tab_time > div > div:nth-child(2) > div > ul > li:nth-child(2) > a')) # None
for elem in soup.find_all('a', class_=['e0WtYb', 'HpzMff', 'PJLMUc'], href=True):
print(elem) # others with class 'e0WtYb'
...
# and then something like elem['href']
https://artsandculture.google.com/category/artist?tab=time&date=1850
Copied selector from Chrome
#tab_time > div > div:nth-child(2) > div > ul > li:nth-child(2) > a
Upvotes: 5
Views: 544
Reputation: 99
To scrape Google Arts and Culture you can only use BeautifulSoup
web scraping library. However, we need to take into account the fact that the page is dynamic and change a strategy from parsing HTML elements (CSS selectors, etc.) to parsing data with regular expressions.
We need regular expressions because the information we need comes from the server and stores as inline JSON which is used to render via JavaScript (guess). First of all, we need to look at the page code (CTRL + U) to find matches and, if so, look where they're exactly.
Since information about three tabs (All, A-Z, Time) is returned to us at once, we need to select part of JSON that returns information about the "Time" tab using regular expressions to find matches and extract them. For example, author, link to the author, and number of paintings.
Here's an example regular expression that extracts part of the inline JSON that contains data from the "Time" tab:
# https://regex101.com/r/4XAQ49/1
portion_of_script_tags = re.search("\[\"stella\.pr\",\"DatedAssets:.*\",\[\[\"stella\.common\.cobject\",(.*?)\[\]\]\]\;<\/script>", str(all_script_tags)).group(1)
Also need to pay attention because the request might be blocked (if using requests
as default user-agent in requests
library is a python-requests
. Additional step could be to rotate user-agent
, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.
A code snippet that extracts 54 authors and code in the online IDE.
from bs4 import BeautifulSoup
import requests, json, re, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
"tab": "time",
"date": "1850"
}
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
html = requests.get(f"https://artsandculture.google.com/category/artist", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
author_results = []
all_script_tags = soup.select("script")
# https://regex101.com/r/4XAQ49/1
portion_of_script_tags = re.search("\[\"stella\.pr\",\"DatedAssets:.*\",\[\[\"stella\.common\.cobject\",(.*?)\[\]\]\]\;<\/script>", str(all_script_tags)).group(1)
# https://regex101.com/r/XXAbKH/1
authors = re.findall(r"\"((?!stella\.common\.cobject)\w.*?)\",\"\d+", str(portion_of_script_tags))
# https://regex101.com/r/K4K3iB/1
author_links = [f"https://artsandculture.google.com{link}" for link in re.findall("\"(/entity.*?)\"", str(portion_of_script_tags))]
# https://regex101.com/r/x6wwVJ/1
number_of_artworks = re.findall("\"(\d+).*?items\"", str(portion_of_script_tags))
for author, author_link, num_artworks in zip(authors, author_links, number_of_artworks):
author_results.append({
"author": author,
"author_link": author_link,
"number_of_artworks": num_artworks
})
print(json.dumps(author_results, indent=2, ensure_ascii=False))
Example output
[
{
"author": "Vincent van Gogh",
"author_link": "https://artsandculture.google.com/entity/vincent-van-gogh/m07_m2?categoryId\\u003dartist",
"number_of_artworks": "338"
},
{
"author": "Claude Monet",
"author_link": "https://artsandculture.google.com/entity/claude-monet/m01xnj?categoryId\\u003dartist",
"number_of_artworks": "275"
},
{
"author": "Paul Cézanne",
"author_link": "https://artsandculture.google.com/entity/paul-cézanne/m063mx?categoryId\\u003dartist",
"number_of_artworks": "301"
},
{
"author": "Paul Gauguin",
"author_link": "https://artsandculture.google.com/entity/paul-gauguin/m0h82x?categoryId\\u003dartist",
"number_of_artworks": "380"
},
# ...
]
Upvotes: 3
Reputation: 899
Unfortunately, the problem is not that you're using BeautifulSoup
wrong. The webpage that you're requesting appears to be missing its content! I saved html.text
to a file for inspection:
Why does this happen? Because the webpage actually loads its content using JavaScript. When you open the site in your browser, the browser executes the JavaScript, which adds all of the artist squares to the webpage. (You may even notice the brief moment during which the squares aren't there when you first load the site.) On the other hand, requests
does NOT execute JavaScript—it just downloads the contents of the webpage and saves them to a string.
What can you do about it? Unfortunately, this means that scraping the website will be really tough. In such cases, I would suggest looking for an alternative source of information or using an API provided by the website.
Upvotes: 2