Reputation: 31963

Download first 1000 images from google search

I do some search to google images

http://www.google.com/search?hl=en&q=panda&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&biw=1287&bih=672&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&tab=wi&ei=qW4FUJigJ4jWtAbToInABg

and the result is thousands of photos. I am looking for a shell script that will download the first n images, for example 1000 or 500.

How can I do this ?

I guess I need some advanced regular expressions or something like that. I was trying many things but to no avail, can someone help me please?

Upvotes: 15

Answers (10)

Sam Watkins

Reputation: 8359

update 4: PhantomJS is now obsolete, I made a new script google-images.py in Python using Selenium and Chrome headless. See here for more details: https://stackoverflow.com/a/61982397/218294

update 3: I fixed the script to work with phantomjs 2.x.

update 2: I modified the script to use phantomjs. It's harder to install, but at least it works again. http://sam.nipl.net/b/google-images http://sam.nipl.net/b/google-images.js

update 1: Unfortunately this no longer works. It seems Javascript and other magic is now required to find where the images are located. Here is a version of the script for yahoo image search: http://sam.nipl.net/code/nipl-tools/bin/yimg

original answer: I hacked something together for this. I normally write smaller tools and use them together, but you asked for one shell script, not three dozen. This is deliberately dense code.

http://sam.nipl.net/code/nipl-tools/bin/google-images

It seems to work very well so far. Please let me know if you can improve it, or suggest any better coding techniques (given that it's a shell script).

#!/bin/bash
[ $# = 0 ] && { prog=`basename "$0"`;
echo >&2 "usage: $prog query count parallel safe opts timeout tries agent1 agent2
e.g. : $prog ostrich
       $prog nipl 100 20 on isz:l,itp:clipart 5 10"; exit 2; }
query=$1 count=${2:-20} parallel=${3:-10} safe=$4 opts=$5 timeout=${6:-10} tries=${7:-2}
agent1=${8:-Mozilla/5.0} agent2=${9:-Googlebot-Image/1.0}
query_esc=`perl -e 'use URI::Escape; print uri_escape($ARGV[0]);' "$query"`
dir=`echo "$query_esc" | sed 's/%20/-/g'`; mkdir "$dir" || exit 2; cd "$dir"
url="http://www.google.com/search?tbm=isch&safe=$safe&tbs=$opts&q=$query_esc" procs=0
echo >.URL "$url" ; for A; do echo >>.args "$A"; done
htmlsplit() { tr '\n\r \t' ' ' | sed 's/</\n</g; s/>/>\n/g; s/\n *\n/\n/g; s/^ *\n//; s/ $//;'; }
for start in `seq 0 20 $[$count-1]`; do
wget -U"$agent1" -T"$timeout" --tries="$tries" -O- "$url&start=$start" | htmlsplit
done | perl -ne 'use HTML::Entities; /^<a .*?href="(.*?)"/ and print decode_entities($1), "\n";' | grep '/imgres?' |
perl -ne 'use URI::Escape; ($img, $ref) = map { uri_unescape($_) } /imgurl=(.*?)&imgrefurl=(.*?)&/;
$ext = $img; for ($ext) { s,.*[/.],,; s/[^a-z0-9].*//i; $_ ||= "img"; }
$save = sprintf("%04d.$ext", ++$i); print join("\t", $save, $img, $ref), "\n";' |
tee -a .images.tsv |
while IFS=$'\t' read -r save img ref; do
wget -U"$agent2" -T"$timeout" --tries="$tries" --referer="$ref" -O "$save" "$img" || rm "$save" &
procs=$[$procs + 1]; [ $procs = $parallel ] && { wait; procs=0; }
done ; wait

Features:

under 1500 bytes
explains usage, if run with no args
downloads full images in parallel
safe search option
image size, type, etc. opts string
timeout / retries options
impersonates googlebot to fetch all images
numbers image files
saves metadata

I'll post a modular version some time, to show that it can be done quite nicely with a set of shell scripts and simple tools.

Upvotes: 18

Lance Samaria

Reputation: 19612

I used this to download 1000 images and it 100% worked for me: atif93/google_image_downloader

after you download it open terminal and install Selenium

$ pip install selenium --user

then check your python version

$ python --version

If running python 2.7 then to down download 1000 images of pizza run:

$ python image_download_python2.py 'pizza' '1000'

If running python 3 then to down download 1000 images of pizza run:

$ python image_download_python3.py 'pizza' '1000'

The breakdown is:

python image_download_python2.py <query> <number of images>
python image_download_python3.py <query> <number of images>

query is the image name your looking for and the number of images is 1000. In my example above my query is pizza and I want 1000 images of it

Upvotes: 0

Hardik Vasa

Reputation: 57

How about using this library?google-images-download

For anyone still looking for a decent way to download 100s of images, can use this command line argument code.

Upvotes: 0

rishabhr0y

Reputation: 868

Python script: to download full resolution images form Google Image Search currently it downloads 100 images per query

from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
import cookielib
import json

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),"html.parser")


query = raw_input("query image")# you can change the query for the image  here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="C:\\Users\\Rishabh\\Pictures\\"+query.split('+')[0]+"\\"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)


ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

print  "there are total" , len(ActualImages),"images"


###print images
for i , (img , Type) in enumerate( ActualImages):
    try:
        req = urllib2.Request(img, headers={'User-Agent' : header})
        raw_img = urllib2.urlopen(req).read()
        if not os.path.exists(DIR):
            os.mkdir(DIR)
        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(DIR + image_type + "_"+ str(cntr)+".jpg", 'wb')
        else :
            f = open(DIR + image_type + "_"+ str(cntr)+"."+Type, 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

i am re posting my solution here the original solution i had posted on the following question https://stackoverflow.com/a/28487500/2875380

Upvotes: 0

Vijay

Reputation: 911

I found an easier way to do with this tool I can confirm that it works well as of this post. screenshot

Feature Requests to the developer:

Get a preview of the image(s) to verify that it's correct.
Allow input of multiple terms sequentially (i.e. batch processing).

Upvotes: 0

johndpope

Reputation: 5257

there's other libraries on github - this looks quite good https://github.com/Achillefs/google-cse

g = GoogleCSE.image_search('Ian Kilminster')
img = g.fetch.results.first.link
file = img.split('/').last
File.open(file,'w') {|f| f.write(open(img).read)} 
`open -a Preview #{file}`

Upvotes: -1

LeMoussel

Reputation: 5767

with response of Pavan Manjunath, if you want height & width of image

(?<=imgurl=)(?<imgurl>.*?)(?=&).*?(?<=h=)(?<height>.*?)(?=&).*?(?<=w=)(?<width>.*?)(?=&)

You obtain 3 regex groups imgurl, height & width with information.

Upvotes: 0

Ray Hayes

Reputation: 15015

Rather than attempt to parse the HTML (which is very hard and likely to break), consider the API's highlighted by @Paven in his answer.

Additionally, consider using a tool that already tries to do something similar. WGET (web-get) has a spider like feature for following the links (specifically for specified file types). See this answer to a StackOverflow question 'how do i use wget to download all images into a single folder'.

Regex is wonderfully useful, but I don't think it is in this context - remember the Regex mantra:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

-- Jamie Zawinski

Upvotes: 0

ghoti

Reputation: 46876

Rather than doing this in shell with regexps, you may have an easier time if you use something that can actually parse the HTML itself, like PHP's DOMDocument class.

If you're stuck using only shell and need to slurp image URLs, you may not be totally out of luck. Regular Expressions are inappropriate for parsing HTML, because HTML is not a regular language. But you may still be able to get by if your input data is highly predictable. (There is no guarantee of this, because Google updates their products and services regularly and often without prior announcement.)

That said, in the output of the URL you provided in your question, each image URL seems to be embedded in an anchor that links to /imgres?…. If we can parse those links, we can probably gather what we need from them. Within those links, image URLs appear to be preceded with &imgurl=. So let's scrape this.

#!/usr/local/bin/bash

# Possibly violate Google's terms of service by lying about our user agent
agent="Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20100101 Firefox/12.0"

# Search URL
url="http://www.google.com/search?hl=en&q=panda&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&biw=1287&bih=672&um=1&ie=UTF-8&tbm=isch&source=og&sa=N&tab=wi&ei=qW4FUJigJ4jWtAbToInABg"

curl -A "$agent" -s -D- "$url" \
 | awk '{gsub(/<a href=/,"\n")} 1' \
 | awk '
   /imgres/ {
     sub(/" class=rg_l >.*/, "");       # clean things up
     split($0, fields, "\&amp;");       # gather the "GET" fields
     for (n=1; n<=length(fields); n++) {
       split(fields[n], a, "=");        # split name=value pair
       getvars[a[1]]=a[2];              # store in array
     }
     print getvars["imgurl"];           # print the result
   }
 '

I'm using two awk commands because ... well, I'm lazy, and that was the quickest way to generate lines in which I could easily find the "imgres" string. One could spend more time on this cleaning it up and making it more elegant, but the law of diminishing returns dictates that this is as far as I go with this one. :-)

This script returns a list of URLs that you could download easily using other shell tools. For example, if the script is called getimages, then:

./getimages | xargs -n 1 wget

Note that Google appears to be handing me only 83 results (not 1000) when I run this with the search URL you specified in your question. It's possible that this is just the first page that Google would generally hand out to a browser before "expanding" the page (using JavaScript) when I get near the bottom. The proper way to handle this would be to use Google's search API, per Pavan's answer, and to PAY google for their data if you're making more than 100 searches per day.

Upvotes: 2

Pavan Manjunath

Reputation: 28545

I dont think you can achieve the entire task using regexes alone. There are 3 parts to this problem-

1.Extract the links of all the images -----> Cant be done with regexes. You need to use a web based language for this. Google has APIs to do this programatically. Check out here and here.

2.Assuming you succeeded in the first step with some web based language, you can use the following regex which uses lookaheads to extract the exact image URL

(?<=imgurl=).*?(?=&)

The above regex says - Grab everything starting after imgurl= and till you encounter the & symbol. See here for an example, where I took the URL of the first image of your search result and extracted the image URL.

How did I arrive at the above regex? By examining the links of the images found in the image search.

3.Now that you've got the image URLs, use some web based language/tool to download your images.

Upvotes: 6

Download first 1000 images from google search

Answers (10)

Related Questions