Reputation: 2035
My application gets as part of its data a large html formatted file that contains large amounts of links. Something like what you would get if you search anything on Google or Yahoo or other search engines: a list of URLs and the description or other text.
I've been trying to come out with a function that can parse the URL and the description and save them into a text file but it's proven hard, at least to me. So, if I have:
<a href="http://www.w3schools.com">Visit W3Schools</a>
I would parse http://www.w3schools.com
and Visit W3Schools
and save them in a file.
Any way to achieve this? in plain C?
Any help is appreciated.
Upvotes: 0
Views: 263
Reputation: 4314
You really need a proper html parser, but for something quick and dirty, try:
bool get_url(char **data, char **url, char **desc)
{
bool result = false;
char *ptr = strstr(*data, "<a");
if(NULL != ptr)
{
*data = ptr + 2;
ptr = strstr(*data, "href=\"");
if(NULL != ptr)
{
*data = ptr + 6;
*url = *data;
ptr = strchr(*data, '"');
if(NULL != ptr)
{
*ptr = '\0';
*data = ptr + 1;
ptr = strchr(*data, '>');
if(NULL != ptr)
{
*data = ptr + 1;
*desc = *data;
ptr = strstr(*data, "</a>");
if(NULL != ptr)
{
*ptr = '\0';
*data = ptr + 4;
result = true;
}
}
}
}
}
return result;
}
Not that data
gets updated to be beyond the data parsed (it's an in-out parameter) and that the string passed in gets modified. I'm feeling lazy/too busy to do full solutions with memory allocated return strings.
Also you probably ought to return errors on the cascade of close scope braces (except the first one) which is partly why I stacked them up like that. There are other neater solutions that can be adapted to be more generic.
So basically you then call the function repeatedly until it returns false.
Upvotes: 1