Tokenize a line without strtok

Question

I'm reading lines from a file and tokenizing them. Tokens are distinguished being separated by space(s) or if they are inside quotes(example: "to ken").

I wrote a code, but i have a problem with pointers. I don't know how to store tokens from a line or rather set pointers to them.

Also it was suggested that i put a 0 behind every token I "recognize" so i'll know when it ends and that I store in char *tokens[] only pointers that point to a start of tokens.

My current code:

char *tokens[50];
int token_count;

int tokenize(char *line){
    token_count = 0;
    int n = 0;          

    while(line[n] != NULL || line[n] != '
'){
        while(isspace(line[n++]));
        if(line[n] == '"'){
            while(line[++n] != '"' || line[n] != NULL){
                  /* set tokens[n] */
            }
        }
        else{
            while(!isspace(line[n++])){
                  /*set tokens[n] */
            }

        }

        n++;
    }

    tokens[token_count] = 0;

}

M Oehm · Accepted Answer

You use the string base line and the index n to step through the string by incrementing n:

while (str[n] != '\0') n++;

Your task might be easier if you used pointers:

while (*str != '\0') str++;

Your tokens can then be expressed by the value of the pointer before reading the token, i.e. when you hit a quotation mark or a non-space. That gives you the start of the token.

What about the length of the token? In C, strings are arrays of chars, terminated by a null char. That means, your tokens contain the rest of the whole line and therefore all subsequent tokens. You could place a '\0' after each token, but this has two drawbacks: It doesn't work on read-only string literals and, depending on your token syntax, it is not always possible. For example, the string a"b b"c should probably parse as the three tokens a, "b b" and c, but placing null chars after the tokens will break the tokenising process.

An alternative is to store tokens as pairs of pointer to starting char and length. These tokens are no longer null-terminated, so you will have to write them to a temporary buffer if you want to use them with the standard C string functions.

Here's a way to do that.

#include 
#include 
#include 

struct token {
    const char *str;
    int length;
};

int tokenize(const char *p, struct token tk[], int n)
{
    const char *start;
    int count = 0;   

    while (*p) {
        while (isspace(*p)) p++;
        if (*p == '\0') break;

        start = p;
        if (*p == '"') {
            p++;
            while (*p && *p != '"') p++;
            if (*p == '\0') return -1;        /* quote not closed */            
            p++;
        } else {            
            while (*p && !isspace(*p) && *p != '"') p++;
        }

        if (count < n) {
            tk[count].str = start;
            tk[count].length = p - start;
        }
        count++;
    }

    return count;
}

void token_print(const struct token tk[], int n)
{
    int i;

    for (i = 0; i < n; i++) {
        printf("[%d] '%.*s'
", i, tk[i].length, tk[i].str);
    }
}

#define MAX_TOKEN 10

int main()
{
    const char *line = "The "New York" Stock Exchange";
    struct token tk[MAX_TOKEN];
    int n;

    n = tokenize(line, tk, MAX_TOKEN);
    if (n > MAX_TOKEN) n = MAX_TOKEN;
    token_print(tk, n);    

    return 0;
}

The start of each token is saved in a local variable and assigned to the token after it has been scanned. When p points to the character after the token, the expression:

p - start

gives you the length. (This is called pointer arithmetic.) The routine scans all tokens, but it only assigns at most n tokens as not to overflow the provided buffer.

Tokenize a line without strtok

Answers (1)

Related Questions