Kevin
Kevin

Reputation: 1151

Strange assignment in implementing strtok

I am studying the implementation of strtok and have a question. On this line, s [-1] = 0, I don't understand how tok is limited to the first token since we had previously assigned it everything contained in s.

char *strtok(char *s, const char *delim)
{
    static char *last;

    return strtok_r(s, delim, &last);
}

char *strtok_r(char *s, const char *delim, char **last)
{
    char *spanp;
    int c, sc;
    char *tok;

    if (s == NULL && (s = *last) == NULL)
        return (NULL);

    tok = s;
    for (;;) {
        c = *s++;
        spanp = (char *)delim;
        do {
            if ((sc = *spanp++) == c) {
                if (c == 0)
                    s = NULL;
                else
                    s[-1] = 0;
                *last = s;
                return (tok);
            }
        } while (sc != 0);
    }
}

Upvotes: 3

Views: 350

Answers (3)

David C. Rankin
David C. Rankin

Reputation: 84579

This took much more space than I anticipated when I started, but I think it offers a useful explanation along with the others. (it became more of a mission really)

NOTE: This combination of strtok and strtok_r attempt to provide a reentrant implementation of the usual strtok from string.h by saving the address of the last character as a static variable in strtok. (whether it is reentrant was not tested)

The easiest way to understand this code (at least for me) is to understand what strtok and strtok_r do with the string they are operating on. Here strtok_r is where the work is done. strtok_r basically assigns a pointer to the string provided as an argument and then 'inch-worms' down the string, character-by-character, comparing each character to a delimiter character or null terminating character.

The key is to understand that the job of strtok_r is to chop the string up into separate tokens, which are returned on successive calls to the function. How does it work? The string is broken up into separate tokens by replacing each delimiter character found in the original string with a null-terminating character and returning a pointer to the beginning of the token (which will either be the start of the string on first call, or the next-character after the last delimiter on successive calls)

As with the string.h strtok function, the first call to strtok takes the original string as the first argument. For successive parsing of the same string NULL is used as the first argument. The original string is left littered with null-terminating characters after calls to strtok, so make a copy if you need it further. Below is an explanation of what goes on in strtok_r as you inch-worm down the string.

Consider for example the following string and strtok_r:

'this is a test'

The outer for loop stepping through string s

(ignoring the assignments and the NULL tests, the function assigns tok a pointer to the beginning of the string (tok = s). It then enters the for loop where it will step through string s one character at a time. c is assigned the (int value of) the current character pointed to by 's', and the pointer for s in incremented to the next character (this is the for loop increment of 's'). spanp is assigned the pointer to the delimiter array.

The inner do loop stepping though the delimeters 'delim'

The do loop is entered and then, using the spanp pointer, proceeds to go through the delim array testing if sc (the spanp character) equals the current for loop character c. If and only if our character c matches a delimiter, we then encounter the confusing if (c == 0) if-then-else test.

The if (c == 0) if-then-else test

This test is actually simple to understand when you think about it. As we are crawling down string s checking each character against the delim array. If we match one of the delimiters or hit the end, then what? We are about to return from the function, so what must we do?

Here we ask, did we reach the normal end of the string (c == 0), if so we set s = NULL, otherwise we match a delimiter, but are not at the end of the string.

Here is where the magic happens. We need to replace the delimiter character in the string with a null-terminating character (either 0 or '\0'). Why not set the pointer s = 0 here? Answer: we can't, we incremented it assigning c = *s++; at the beginning of the for loop, so s is now pointing to the next character in the string rather than the delimiter. So in order to replace the delimiter in string s with a null-terminating character, we must do s[-1] = 0; This is where the string s gets chopped into a token. last is assigned the address of the current pointer s and tok (pointing to the original beginning of s) is returned by the function.

So, in the main program, you how have the return of strtok_r which is a pointer pointing to the first character in the string s you passed to strtok_r which is now null-terminated at the first occurrence of the matching character in delim providing you with the token from the original string s you asked for.

Upvotes: 1

David K
David K

Reputation: 3132

There are two ways to reach the statement return(tok);. One way is that at the point where tok = s; occurs, s contains none of the delimiter characters (contents of delim). That means s is a single token. The for loop ends when c == 0, that is, at the null byte at the end of s, and strtok_r returns tok (that is, the entire string that was in s at the time of tok = s;), as it should.

The other way for that return statement to occur is when s contains some character that is in delim. In that case, at some point *spanp == c will be true where *spanp is not the terminating null of delim, and therefore c == 0 is false. At this point, s points to the character after the one from which c was read, and s - 1 points to the place where the delimiter was found. The statement s[-1] = 0; overwrites the delimiter with a null character, so now tok points to a string of characters that starts where tok = s; said to start, and ends at the first delimiter that was found in that string. In other words, tok now points to the first token in that string, no more and no less, and it is correctly returned by the function.

The code is not very well self-documenting in my opinion, so it is understandable that it is confusing.

Upvotes: 0

ooga
ooga

Reputation: 15501

tok was not previously assigned "everything contained in s". It was set to point to the same address as the address in s.

The s[-1] = 0; line is equivalent to *(s - 1) = '\0';, which sets the location just before where s is pointing to zero.

By setting that location to zero, returning the current value of tok will point to a string whose data spans from tok to s - 2 and is properly null-terminated at s - 1.

Also note that before tok is returned, *last is set to the current value of s, which is the starting scan position for the next token. strtok saves this value in a static variable so it can be remembered and automatically used for the next token.

Upvotes: 2

Related Questions