ShaMora
ShaMora

Reputation: 67

split html by tags

I want to split a html page into pieces by a tag delimiter: like <img or <div>. I tried the following code but it doesn't work:

char source[MAXBUFLEN + 1];
FILE *fp = fopen("source.html", "r");
if (fp != NULL)
{
    size_t newLen = fread(source, sizeof(char), MAXBUFLEN, fp);
    if (newLen == 0) {
        fputs("Error reading file", stderr);
    } else {
        source[++newLen] = '\0'; /* Just to be safe. */
    }
}
fclose(fp);

//not working
char* strArray[10];
int i = 0;
char *token = strtok(source, "<img");
while(token != NULL)
{
    strcpy(strArray[i++], token);

    token = strtok(NULL, "<img");
}

printf("%s\n", strArray[3]);

What am I doing wrong? Is there any other method I can use except strtok?

Upvotes: 0

Views: 319

Answers (3)

BLUEPIXY
BLUEPIXY

Reputation: 40145

char *strtokByWord_r(char *str, const char *word, char **store){
    char *p, *ret;
    if(str != NULL){
        *store = str;
    }
    if(*store == NULL) return NULL;
    p = strstr(ret=*store, word);
    if(p){
        *p='\0';
        *store = p + strlen(word);
    } else {
        *store = NULL;
    }
    return ret;
}
char *strtokByWord(char *str, const char *word){
    static char *store = NULL;
    return strtokByWord_r(str, word, &store);
}

replace

char *token = strtok(source, "<img");
...
token = strtok(NULL, "<img");

to

char *token = strtokByWord(source, "<img");
...
token = strtokByWord(NULL, "<img");

Upvotes: 2

Ingo Leonhardt
Ingo Leonhardt

Reputation: 9894

As Daren has already posted, strtok() doesn't do what you want. You can use

char *ptr = strstr( source, "<img" );

instead to find the first tag, and then

ptr = strstr(ptr+4, "<img" ); // search starts direcly behind the previous "<img" 
                              // maybe you can find a better offset

for the next occurances.

Besides, your line

strcpy(strArray[i++], token);

would crash because you have no memory allocated to the pointer.

Upvotes: 2

Daren Thomas
Daren Thomas

Reputation: 70324

The second argument to strtok is a list of delimiter characters. Each of these will be used to split the string into tokens. I don't think it does what you think it does...

If you want to go and parse an html file into tokens, you could look into lex...

What is your desired output? Do you have a test case for your input?

Your code should produce the following:

input:

<html><img src="test.png"/></html>

output:

  • ""
  • "ht"
  • "l>"
  • " src=\"test.pn"
  • "\"/>"
  • "/ht"
  • "l>"

I somehow don't think that is what you want...

Upvotes: 0

Related Questions