Chris
Chris

Reputation: 8432

Problem with string overflow with strtok

I have a file of data:

C0001|H|Espresso Classics|The traditional espresso favourites.
C0002|H|Espresso Espresions|Delicious blend of espresso, milk, and luscious flavours.
C0003|H|Tea & Cocoa|Gloria Jean's uses only the finest cocoas and single garden teas. Always satisfying.
C0004|C|Iced Chocolate|Gloria Jean's version of a traditional favourite.
C0005|C|Mocha Chillers|An icy blend of chocolate, coffee, milk and delicious mix-ins.
C0006|C|Espresso Chillers|A creamy blend of fresh espresso, chocolate, milk, ice, and flavours.
C0007|C|On Ice|Cool refreshing Gloria Jean's creations over ice.

and the following code to tokenize it:

    #define MAX_CAT_TOK 4
    #define DATA_DELIM "|"
    char *token[100];

    for(i = 0; i < MAX_CAT_TOK; i++)
    {
        if(i == 0) token[i] = strtok(input, DATA_DELIM);
         else token[i] = strtok(NULL, DATA_DELIM);
         printf("%s\n", token[i]);
    }

The problem is that once a string that follows a longer string is printed, data from the longer string is printed out at the end of the shorter string. I'm assuming this has something to do with string termination??

Does anybody see something wrong I am doing here?

Upvotes: 2

Views: 877

Answers (3)

Mark Wilkins
Mark Wilkins

Reputation: 41242

It sounds like what is happening is that your buffer input is not being null terminated correctly. If, perhaps, it is initially all zeros, then the first line that is processed would be fine. If a longer input is stored in it, then it would still be fine. But then when an entry is stored in it that is shorter than the previous (e.g., the 4th line in your example), it could result in the problem if it was not null terminated.

If, for example, the new data was copied in via memcpy and did not include a null termination character, then the tokenization of the 4th item in the line would include that previous data.

If this is the case, then the solution is to make sure input is null terminated correctly.

The following attempts to show what I am trying to say:

strcpy( input, "a|b|c|some long data" );
tokenize( input );   // where tokenize is the logic shown in the OP calling strtok
// note the use of memcpy here rather than strcpy to show the idea
// and also note that it copies exactly 11 characters (doesn't include the null)
memcpy( input, "1|2|3|short", 11 ); 
tokenize( input );

In the above contrived example, the 4th item in the second tokenizing would be: shortlong data.

Edit In other words, the problem does not appear to be in the code shown in the OP. The problem lies in how `input is filled. You can probably see that it is not properly null terminated if you add a printf prior to your for loop that shows the actual data being parsed. The 4th line will likely show that it contains remnants of the previous line:

printf( "%s\n", input );

Upvotes: 3

Jonathan Leffler
Jonathan Leffler

Reputation: 754450

Here's my working code:

#include <string.h>
#include <stdio.h>

//#define DATA_DELIM "|"
#define DATA_DELIM "|\n"

int main(void)
{
    enum { LINE_LENGTH = 4096 };
    char input[LINE_LENGTH];
#define MAX_CAT_TOK 4
    char *token[100];

    while (fgets(input, sizeof(input), stdin) != 0)
    {
        printf("Input: %s", input);
        for (int i = 0; i < MAX_CAT_TOK; i++)
        {
            if (i == 0)
                token[i] = strtok(input, DATA_DELIM);
            else
                token[i] = strtok(NULL, DATA_DELIM);
            printf("%d: %s\n", i, token[i] != 0 ? token[i] : "<<NULL POINTER>>");
        }
    }
    return 0;
}

On the given data, I get:

Input: C0001|H|Espresso Classics|The traditional espresso favourites.
0: C0001
1: H
2: Espresso Classics
3: The traditional espresso favourites.
Input: C0002|H|Espresso Espresions|Delicious blend of espresso, milk, and luscious flavours.
0: C0002
1: H
2: Espresso Espresions
3: Delicious blend of espresso, milk, and luscious flavours.
Input: C0003|H|Tea & Cocoa|Gloria Jean's uses only the finest cocoas and single garden teas. Always satisfying.
0: C0003
1: H
2: Tea & Cocoa
3: Gloria Jean's uses only the finest cocoas and single garden teas. Always satisfying.
Input: C0004|C|Iced Chocolate|Gloria Jean's version of a traditional favourite.
0: C0004
1: C
2: Iced Chocolate
3: Gloria Jean's version of a traditional favourite.
Input: C0005|C|Mocha Chillers|An icy blend of chocolate, coffee, milk and delicious mix-ins.
0: C0005
1: C
2: Mocha Chillers
3: An icy blend of chocolate, coffee, milk and delicious mix-ins.
Input: C0006|C|Espresso Chillers|A creamy blend of fresh espresso, chocolate, milk, ice, and flavours.
0: C0006
1: C
2: Espresso Chillers
3: A creamy blend of fresh espresso, chocolate, milk, ice, and flavours.
Input: C0007|C|On Ice|Cool refreshing Gloria Jean's creations over ice.
0: C0007
1: C
2: On Ice
3: Cool refreshing Gloria Jean's creations over ice.

With the single-character delimiter string, I get an extra newline after each of the lines numbered 3.

This looks plausibly like what you wanted. So, either there's a problem with your input (did you echo it as you read it), or you've managed to find a flaky implementation of strtok(), or perhaps you are on Windows and the data lines have carriage returns as well as newlines, and you are seeing misleading output because of stray carriage returns.

Of these, I suspect the last (Windows and stray carriage returns) is the most likely - though I was not able to reproduce the problem even with a DOS-formatted data file (testing on MacOS X 10.6.7 with GCC 4.6.0).

Upvotes: 0

pmg
pmg

Reputation: 108978

I can't see anything wrong.

I've taken the liberty to make a compilable version of your code and put it at ideone. Compare with your version ...

#include <stdio.h>
#include <string.h>

int main(void) {
  int i, j;
  char *token[100];
  char *input;
  char inputs[7][300] = {
    "C0001|H|Espresso Classics|The traditional espresso favourites.",
    "C0002|H|Espresso Espresions|Delicious blend of espresso, milk, and luscious flavours.",
    "C0003|H|Tea & Cocoa|Gloria Jean's uses only the finest cocoas and single garden teas. Always satisfying.",
    "C0004|C|Iced Chocolate|Gloria Jean's version of a traditional favourite.",
    "C0005|C|Mocha Chillers|An icy blend of chocolate, coffee, milk and delicious mix-ins.",
    "C0006|C|Espresso Chillers|A creamy blend of fresh espresso, chocolate, milk, ice, and flavours.",
    "C0007|C|On Ice|Cool refreshing Gloria Jean's creations over ice.",
  };

  for (j = 0; j < 7; j++) {
    input = inputs[j];
    for (i = 0; i < 4; i++) {
      if (i == 0) {
        token[i] = strtok(input, "|");
      } else {
        token[i] = strtok(NULL, "|");
      }
      printf("%s\n", token[i]);
    }
  }
  return 0;
}

Upvotes: 2

Related Questions