Reputation:
I would need to scrape a image from this website: https://web.archive.org/web/
for example for stackoverflow, towardsdatascience
.
URL
stackoverflow.com
towardsdatascience.com
I do not know how to include information on the table/image within
<div class="sparkline" style="width: 1225px;"><div id="wm-graph-anchor"><div id="wm-ipp-sparkline" title="Explore captures for this URL" style="height: 77px;"><canvas class="sparkline-canvas" width="1225" height="75" alt="sparklines"></canvas></div></div><div id="year-labels"><span class="sparkline-year-label">1996</span><span class="sparkline-year-label">1997</span><span class="sparkline-year-label">1998</span><span class="sparkline-year-label">1999</span><span class="sparkline-year-label">2000</span><span class="sparkline-year-label">2001</span><span class="sparkline-year-label">2002</span><span class="sparkline-year-label">2003</span><span class="sparkline-year-label">2004</span><span class="sparkline-year-label">2005</span><span class="sparkline-year-label">2006</span><span class="sparkline-year-label">2007</span><span class="sparkline-year-label">2008</span><span class="sparkline-year-label">2009</span><span class="sparkline-year-label">2010</span><span class="sparkline-year-label">2011</span><span class="sparkline-year-label">2012</span><span class="sparkline-year-label">2013</span><span class="sparkline-year-label">2014</span><span class="sparkline-year-label">2015</span><span class="sparkline-year-label">2016</span><span class="sparkline-year-label">2017</span><span class="sparkline-year-label">2018</span><span class="sparkline-year-label">2019</span><span class="sparkline-year-label selected-year">2020</span></div></div>
i.e. the image where the timeline is shown through years. I would like to save per each website this image/table, if possible. I tried to write some code, but it misses this part:
import json
import requests
def my_function(file):
urls = list(set(file.URL.tolist()))
df_url= pd.DataFrame(columns=['URL'])
df_url['URL']=urls
api_url = 'https://web.archive.org/__wb/search/metadata'
for url in df_url['URL']:
res = requests.get(api_url, params={'q': url})
# part to scrape the image
return
my_function(df)
Can you give me some input on how to get those images?
Upvotes: 0
Views: 62
Reputation: 15498
If you have each image URL in the for loop, you can download the images using python library urllib.request
function urlretrive
:
First import it at the beginning of the script using
import os
from urllib.parse import urlparse
import urllib.request
And then download them using
for url in df_url['URL']:
urllib.request.urlretrieve(url,os.path.basename(urlparse(url).path))
If you don't to save using URL basename, then don't make first 2 imports.
Upvotes: 0