Reputation: 1175
I'm trying do obtain images from Google Image search for a specific query. But the page I download is without pictures and it redirects me to Google's original one. Here's my code:
AGENT_ID = "Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1"
GOOGLE_URL = "https://www.google.com/images?source=hp&q={0}"
_myGooglePage = ""
def scrape(self, theQuery) :
self._myGooglePage = subprocess.check_output(["curl", "-L", "-A", self.AGENT_ID, self.GOOGLE_URL.format(urllib.quote(theQuery))], stderr=subprocess.STDOUT)
print self.GOOGLE_URL.format(urllib.quote(theQuery))
print self._myGooglePage
f = open('./../../googleimages.html', 'w')
f.write(self._myGooglePage)
What am I doing wrong?
Thanks
Upvotes: 6
Views: 7752
Reputation: 6539
One of the best ways is to use icrawler. Check below answer. It is working for me.
https://stackoverflow.com/a/51204611/4198099
Upvotes: 0
Reputation: 624
i am just joing to answer this, even though it is old. there is a much simpler way to go about doing this.
def google_image(x):
search = x.split()
search = '%20'.join(map(str, search))
url = 'http://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=%s&safe=off' % search
search_results = urllib.request.urlopen(url)
js = json.loads(search_results.read().decode())
results = js['responseData']['results']
for i in results: rest = i['unescapedUrl']
return rest
that is it.
Upvotes: 0
Reputation: 3697
This is the code in Python that I use to search and download images from Google, hope it helps:
import os
import sys
import time
from urllib import FancyURLopener
import urllib2
import simplejson
# Define search term
searchTerm = "hello world"
# Replace spaces ' ' in search term for '%20' in order to comply with request
searchTerm = searchTerm.replace(' ','%20')
# Start FancyURLopener with defined version
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
# Set count to 0
count= 0
for i in range(0,10):
# Notice that the start changes for each iteration in order to request a new set of images for each loop
url = ('https://ajax.googleapis.com/ajax/services/search/images?' + 'v=1.0&q='+searchTerm+'&start='+str(i*4)+'&userip=MyIP')
print url
request = urllib2.Request(url, None, {'Referer': 'testing'})
response = urllib2.urlopen(request)
# Get results using JSON
results = simplejson.load(response)
data = results['responseData']
dataInfo = data['results']
# Iterate for each result and get unescaped url
for myUrl in dataInfo:
count = count + 1
print myUrl['unescapedUrl']
myopener.retrieve(myUrl['unescapedUrl'],str(count)+'.jpg')
# Sleep for one second to prevent IP blocking from Google
time.sleep(1)
You can also find very useful information here.
Upvotes: 6
Reputation: 724
I'll give you a hint ... start here:
https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=JULIE%20NEWMAR
Where JULIE and NEWMAR are your search terms.
That will return the json data you need ... you'll need to parse that using json.load or simplejson.load to get back a dict ... followed by diving into it to find first the responseData, then the results list which contains the individual items whose url you will then want to download.
Though I don't suggest in any way doing automated scraping of Google, since their (deprecated) API for this specifically says not to.
Upvotes: 3