samprat
samprat

Reputation: 2214

connect to website purely by raw socket connection

Language -> C++11 or C++98 {NOT C}
OS -> Linux embedded system
Restriction-> NO use of any 3rd party library. Overview -> to establish connection with website.
I have an Linux embedded system and Its not allowed to download any libraries like poco or libcurl or boost to establish connection with website and extract information. So I am wondering if someone can direct me to how to establish connection purely by raw sockets in C++ [not c] and retrieve information from page.

Parsing the information and retrieving exact information is not a challenge for me, my main problem is how would I establish connection over http protocol. If I am right , to connect to website I need http protocol rather than TCP/IP.
Could some one please point me to right direction. Thanks

Upvotes: 0

Views: 3670

Answers (1)

Programmer
Programmer

Reputation: 125275

You can communicate with HTTP with raw TCP socket.Since you didn't provide code, I can't provide code either. If you already know how to connect, send and receive data from server, it should be easy. Just follow the steps below. Let's assume you want to connect to www.cnn.com.

1. Convert the the domain name of the website to an IP Address.

2. Connect to that IP address with port 80.

3. Send the string GET / HTTP/1.1\r\nHost: www.cnn.com\r\nConnection: close\r\n\r\n

4. Read from the socket/server. If the server is available, it will respond with the page or html code on that webpage.

5. Close socket connection.

Note that some websites will not respond or will even block you if you don't provide the User-Agent/Web browser name you are using.

To fix this, in step add, add User-Agent:MyBrowserName \r\n header to the string. You can fake browsers. You must put \r\n after each header.

For example, the Chrome browser I am using is Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36.

Your new string that will be sent in Step 3 should look something like this GET / HTTP/1.1\r\nHost: www.cnn.com\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36\r\n\r\n. You should notice that there is \r\n after each header. The last header ends with \r\n\r\n instead of \r\n.

Other useful headers are Connection: Keep-Alive\r\n , Accept-Language: en-us\r\n, Accept-Encoding: gzip, deflate\r\n ,

Replace port 80 with 443 if the website is https instead of http. Things get complicated from here because you have to implement the SSL protocol.

Assuming you want to access page in another directory instead of the home page and the url is http://www.cnn.com/2016/05/13/health/healthy-eating-quiz/index.html

The string to send should look like this:

GET /2016/05/13/health/healthy-eating-quiz/index.html HTTP/1.1\r\nHost: www.cnn.com\r\nConnection: close\r\n\r\n

If you are using proxy, you have to put the whole url after GET command:

GET GET http://www.cnn.com/2016/05/13/health/healthy-eating-quiz/index.html HTTP/1.1\r\nHost: www.cnn.com\r\nConnection: close\r\n\r\n

Upvotes: 4

Related Questions