Reputation:
I need to read all the HTML text from a url like http://localhost/index.html
into a string in C.
I know that if i put on telnet -> telnet www.google.com 80 Get webpage....
it returns all the html.
How do I do this in a linux environment with C?
Upvotes: 2
Views: 2674
Reputation: 202475
Assuming you know how to read a file into a string, I'd try
const char *url_contents(const char *url) {
// create w3m command and pass it to popen()
int bufsize = strlen(url) + 100;
char *buf = malloc(bufsize);
snprintf(buf, bufsize, "w3m -dump_source '%s'");
// get a file handle, read all the html from it, close, and return
FILE *html = popen(buf, "r");
const char *s = read_file_into_string(html); // you write this function
fclose(html);
return s;
}
You fork a process, but it's a lot easier to let w3m
do the heavy lifting.
Upvotes: 0
Reputation: 9110
I would suggest using a couple of libraries, which are commonly available on most Linux distributions:
libcurl and libxml2
libcurl provides a comprehensive suite of http features, and libxml2 provides a module for parsing html, called HTMLParser
Hope that points you in the right direction
Upvotes: 5
Reputation: 363
Below is a rough outline of code (i.e. not much error checking and I haven't tried to compile it) to get your started, but use http://www.tenouk.com/cnlinuxsockettutorials.html to learn socket programming. Lookup gethostbyname if you need to translate a hostname (like google.com) into an IP address. Also you may need to do some work to parse out the content length from the HTTP response and then make sure you keep calling recv until you've gotten all the bytes.
#include <netinet/in.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <string.h>
#include <stdlib.h>
void getWebpage(char *buffer, int bufsize, char *ipaddress)
{
int sockfd;
struct sockaddr_in destAddr;
if((sockfd = socket(PF_INET, SOCK_STREAM, 0)) == -1){
fprintf(stderr, "Error opening client socket\n");
close(sockfd);
return;
}
destAddr.sin_family = PF_INET;
destAddr.sin_port = htons(80); // HTTP port is 80
destAddr.sin_addr.s_addr = inet_addr(ipaddress); // Get int representation of IP
memset(&(destAddr.sin_zero), 0, 8);
if(connect(sockfd, (struct sockaddr *)&destAddr, sizeof(struct sockaddr)) == -1){
fprintf(stderr, "Error with client connecting to server\n");
close(sockfd);
return;
}
// Send http request
char *httprequest = "GET / HTTP/1.0";
send(sockfd, httprequest, strlen(httprequest), 0);
recv(sockfd, buffer, bufsize, 0);
// Now buffer has the HTTP response which includes the webpage. You can either
// trim off the HTTP header, or just leave it in depending on what you are doing
// with the page
}
Upvotes: 2
Reputation: 955
if you really don't feel like messing around with sockets, you could always create a named temp file, fork off a process and execvp() it to run wget -0 , and then read the input from that temp file.
although this would be a pretty lame and inefficient way to do things, it would mean you wouldn't have to mess with TCP and sending HTTP requests.
Upvotes: 1
Reputation: 2001
You use sockets, interrogate the web server with HTTP (where you have "http://localhost/index.html") and then parse the data which you have received.
Helpful if you are a beginner in socket programming: http://beej.us/guide/bgnet/
Upvotes: 0