midkin
midkin

Reputation: 1543

Retrieving an image over HTTP in Python

Am reading from a free ebook called "Python for Informatics".

I have the following code:

import socket
import time

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0\n\n')

count = 0
picture = ""

while True:
    data = mysock.recv(5120)
    if (len(data) < 1):
        break
    # time.sleep(0.25)
    count = count + len(data)
    print len(data), count  
    picture = picture + data

mysock.close()

# Look for the end of the header (2 CRLF)
pos = picture.find("\r\n\r\n")
print 'Header length',pos
print picture[:pos]

# Skip past the header and save the picture data
picture = picture[pos+4:]
fhand = open("stuff.jpg","w")
fhand.write(picture)
fhand.close()

I have no knowledge in http and am having a hard time understanding the above code!

I think I do understand what mysock.connect() and mysock.send() do however I need explanation of the 1st line: 1) mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) . What does it do?

Now, about the line: 2) data = mysock.recv(5120). It created a var called data in which saves 5120 bytes its time the while loop run. But what type of data is this data and what happens when I run: picture = picture + data ? It's picture = "" + data,

???

and finally: 3)

pos = picture.find("\r\n\r\n")
print 'Header length',pos
print picture[:pos]

pos = picture.find("/r/n/r/n"), this searches inside picture variable to find 2 new lines "/n/n" because we used the line mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0\n\n')?? Is there any way to instantly save the jpeg file on our hard drive without retrieving the http header and seperating the header from the jpeg file?

Sorry for my English... Feel free to ask something that you may don't understand! Thanks

Upvotes: 0

Views: 4083

Answers (2)

holdenweb
holdenweb

Reputation: 37103

  1. The line mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) calls the socket class from the socket library to create a new network endpoint. socket.AF_INET tells the call to create an IP-based socket, and socket.SOCK_STREAM requests a stream-oriented (TCP) socket, which automatically sends any necessary acknowledgements and retries as appropriate.

  2. The statement data = mysock.recv(5120) reads chunks of up to 5120 bytes. When there is no more data the recv() call returns the empty string. The test seems rather perverse, and it would IMHO be better to use if len(data) == 0 or even if not len(data), but this is a detail of style rather than substance. The statement picture = picture + data therefore accumulates the response data 5120 bytes at a time (though the naming is poor, because the accumulated data actually includes the HTTP headers as well as the picture data).

  3. The statement pos = picture.find("\r\n\r\n") seeks inside the returned string to locate the end of the HTTP headers. Since it finds the beginning rather than the end of the string, 4 must be added to the offset to give the starting position of the picture data.

The example given is attempting to demonstrate low-level access to HTTP data without, apparently, giving you sufficient background about what is going on. A more normal way to access the data would use a higher-level library such as urllib. Here's some code that retrieves the image much more simply:

>>> import urllib
>>> response = urllib.urlopen("http://www.py4inf.com/cover.jpg")
>>> content = response.read()
>>> outf = open("cover.jpg", 'wb')
>>> outf.write(content)
>>> outf.close()

I could open the resulting JPEG file without any issues.

EDIT 2020-10-09 A more up-to-date way of obtaining the same result would use the requests module to the same effect, and a context manager to ensure correct resource management.

>>> import requests
>>> response = requests.get("http://www.py4inf.com/cover.jpg")
>>> with open("result.jpg", "wb") as outf:
...     outf.write(response.content)
...
70057
>>>

Upvotes: 3

saulspatz
saulspatz

Reputation: 5261

  1. Your first question has been asked and answered several times on SO. The short answer is, "It's just a technicality; you don't really need to know."

  2. You are correct.

  3. The header ends with two CRLF. If you save the file without discarding the header, it won't be in JPEG format, and you won't be able to use it. The header is there to permit the file to be transmitted over the internet. You have to discard it and save only the payload.

Upvotes: 1

Related Questions