Reputation: 55
So I wanted to scrape visualizations from visual.ly, however right now I do not understand how the "show more" button works. As of now, my code will get the image link, the text next to the image, and the link of the page. I was wondering how the "show more" button functions, because I was going to try to loop through using the number of pages. As of now I do not know how i would loop through each one individually. Any ideas on how I could loop through and go on to get more images than they originally show you????
from BeautifulSoup import BeautifulSoup
import urllib2
import HTMLParser
import urllib, re
counter = 1
columnno = 1
parser = HTMLParser.HTMLParser()
soup = BeautifulSoup(urllib2.urlopen('http://visual.ly/?view=explore& type=static#v2_filter').read())
image = soup.findAll("div", attrs = {'class': 'view-mode-wrapper'})
if columnno < 4:
column = image[0].findAll("div", attrs = {'class': 'v2_grid_column'})
columnno += 1
else:
column = image[0].findAll("div", attrs = {'class': 'v2_grid_column last'})
visualizations = column[0].findAll("div", attrs = {'class': '0 v2_grid_item viewmode-item'})
getImage = visualizations[0].find("a")
print counter
print getImage['href']
soup1 = BeautifulSoup(urllib2.urlopen(getImage['href']).read())
theImage = soup1.findAll("div", attrs = {'class': 'ig-graphic-wrapper'})
text = soup1.findAll("div", attrs = {'class': 'ig-content-right'})
getText = text[0].findAll("div", attrs = {'class': 'ig-description right-section first'})
imageLink = theImage[0].find("a")
print imageLink['href']
print getText
for row in image:
theImage = image[0].find("a")
actually_download = False
if actually_download:
filename = link.split('/')[-1]
urllib.urlretrieve(link, filename)
counter += 1
Upvotes: 2
Views: 206
Reputation: 8837
You cannot use a urllib-parser combo here because it uses javascript to load more content. In order to do this you will need a full force browser emulator (with javascript support). I have never used Selenium before, but I have heard that it does this, as well as has a python binding
However, I have found that it uses a very predictable form
http://visual.ly/?page=<page_number>
for its GET requests. Perhaps an easier way would be to go under
<div class="view-mode-wrapper">...</div>
to parse the data (using the above url format). After all, ajax requests must go to a location.
Then you could do
for i in xrange(<whatever>):
url = r'http://visual.ly/?page={pagenum}'.format(pagenum=i)
#do whatever you want from here
Upvotes: 1