ytrp
ytrp

Reputation: 291

How does one programmatically download files from the web?

How are files downloaded from servers in programming languages like C? I understand higher level languages have magic functions like "download_file_from_url()" but they don't help me understand what is actually going on. I'm a little familiar with sockets but network programming in general is still a black box to me. Thanks for any help.

Upvotes: 3

Views: 3840

Answers (6)

T.J. Crowder
T.J. Crowder

Reputation: 1073978

Basically, at a low-ish level, the program is opening a socket to port 80 (usually) on the server and sending it a request that looks something like this:

GET /index.html HTTP/1.1
Host: stackoverflow.com

...followed by a blank line.

The server then responds with the data, which typically consists of a few header lines, a blank line, and the requested resource. With HTTP 1.1 the default is to keep the connection alive for subsequent requests (although the server could terminate it if it liked); if I'd used HTTP 1.0 or added a Connection: close header, the server would break the connnection after sending the resource.

Check out the Wikipedia article on HTTP for details, or if you really want to get into it, check out the spec (all-in-one-page here). You can see what this looks like for yourself if you have telnet (and you probably do). Just type telnet stackoverflow.com 80 and then type in the lines above. Remember to press Enter on the blank line.

You do not want to reinvent this wheel. Virtually all languages and environments have a library available to help you that deals with all of the intricacies. (For instance, try the example above with www.stackoverflow.com instead of stackoverflow.com in both places — you get back a "moved permanently" response because the SO team want SO to be at stackoverflow.com, not www.stackoverflow.com. There are also "moved temporarily" responses, etc., etc.)

Upvotes: 13

Yasir Arsanukayev
Yasir Arsanukayev

Reputation: 9676

If you are downloading a file using HTTP then you should read RFC on HTTP (how data is split by chunks etc.), using FTP — RFC on FTP (which commands are used, e. g. PWD, CD etc.). However these are higher-level protocols that utilize sockets anyway.

Upvotes: 1

DVK
DVK

Reputation: 129363

To download a file (assume a simple case - no firewall etc...), you need to:

  • Connect to a DNS server to resolve the name of the URL's server into an IP

  • Open a connection to that IP on the URL's port or default port for your protocol (80 for http)

  • Send the appropriate HTTP command over to that server

  • Listen for HTTP response

  • Process response correctly, and if the response contains the data for the file, keepr eding the reponse and saving the data in temp file

  • When file is fully downloaded, close the connection and move the complete temp file into proper location.

Upvotes: 1

David Gelhar
David Gelhar

Reputation: 27900

And a "black box" is probably a good way to keep it :-)

You do the same thing in C that you would do in "higher level languages" - use a library function that does it for you. (The difference is that the library function isn't a standard built-in part of the language).

One choice for C is libcurl

Upvotes: 4

jkramer
jkramer

Reputation: 15738

Use a library like libcurl.

Upvotes: 0

Hank Gay
Hank Gay

Reputation: 71939

You should check out libcurl - it's open source so you can dig through it and see how a respected library approaches the problem.

Upvotes: 9

Related Questions