Jessica
Jessica

Reputation: 2035

count and parse all the href links out of a html file

Following my previous question I have been trying to parse the href strings out of a html file in order to send that string to the solution of my previous question.

this is what I have but it doesn't work...

void ParseUrls(char* Buffer)
{
    char *begin = Buffer;
    char *end = NULL;
    int total = 0;

    while(strstr(begin, "href=\"") != NULL)
    {   
        end = strstr(begin, "</a>");
        if(end != NULL)
        {
            char *url = (char*) malloc (1000 * sizeof(char));

            strncpy(url, begin, 100);
            printf("URL = %s\n", url);

            if(url) free(url);
        }

        total++;
        begin++;
    }

    printf("Total URLs = %d\n", total);
    return;
}

basically I need to extract into a string the information of the href, something like:

<a href="http://www.w3schools.com">Visit W3Schools</a>

Any help is appreciated.

Upvotes: 1

Views: 1557

Answers (2)

The Archetypal Paul
The Archetypal Paul

Reputation: 41749

There's a lot of things wrong with this code.

  • You increment begin only by one each time around the loop. This means you find the same href over and over again. I think you meant to move begin to after end?

  • The strncpy will normally copy 100 characters (as the HTML will be longer) and so will not nul-terminate the string. You want url[100] = '\0' somewhere

  • Why do you allocate 1000 characters and use only 100?

  • You search for end starting with begin. This means if there's a before the href="" you'll find that instead.

  • You don't use end for anything.

  • Why don't you search for the terminating quote at the end of the URL?

Given the above issues (and adding the termination of URL) it works OK for me.

Given

"<a href=\"/email_services.php\">Email services</a> "

it prints

URL = <a href="/email_services.php">Email services</a> 
URL = a href="/email_services.php">Email services</a> 
URL =  href="/email_services.php">Email services</a> 
URL = href="/email_services.php">Email services</a> 
Total URLs = 4

For the allocation of space, I think you should keep the result of the strstr of "href=\"" (call this start and then the size you need is end - start (+1 for the terminating NUL). Allocate that much space, strncpy it across, add the NUL and Robert's your parent's male sibling.

Also, remember href= isn't unique to anchors. It can appear in some other tags too.

Upvotes: 1

Steve Townsend
Steve Townsend

Reputation: 54148

This does not really answer your qustion about this code, but it would probably be more reliable to use a C library to do this, such as HTMLParser from libxml2.

HTML parsing looks easy, but there are edge cases that make it easier to use something that is known to work than to work though them all yourself.

Upvotes: 0

Related Questions