Reputation: 31
I am trying to parse a CSV file in C. I have each line of my file scanned into the array called lines, which works. Then, I check each character in the line to see if it is a comma (44).
I am having trouble with the last else statement, which should start a new token when there is a comma.
The first token of the line is always read correctly, but the rest are not (strange symbols/characters appear in output). I tried removing the '\0' statement, since I'm not sure that I needed it, but I have the same problem. I am guessing this is some kind of undefined behavior, but I am not sure.
Thanks!
//[rows = num strings] [max num chars per string]
int max_len = 21;
int num_strings = 12;
char lines[num_strings][max_len];
//Open file
data = fopen("data.txt", "r");
//Check if file opened correctly
if (data == NULL) {
printf ("File did not open correctly.\n");
}
//Parse each token
char tokens[60][21];
int counter = 0;
//Read each line
for(int i=0; i<num_strings; i++)
{
//Scan line into lines[i]
fscanf(data, "%s", lines[i]);
printf("\nThis line = %s\n",lines[i]);
//Read each char in line
for(int j=0; j<strlen(lines[i]); j++)
{
char *c = &lines[i][j];
//printf("Current char of line: %c\n", c[0]);
//If it's not a comma (or null character), add to current token
if(c[0] != 44) {
tokens[counter][j] = c[0];
} else {//If it is, terminate string and go to next token
tokens[counter][j] = '\0';
printf("This token = %s\n",tokens[counter]);
counter++;
}
}
}
Upvotes: 1
Views: 318
Reputation: 811
Your code has a couple of issues, I'll start off by giving you a working main inner-loop of the program:
int tok_i = 0;
int jmax = strlen(lines[i]) + 1;
for(int j = 0; j < jmax; j++)
{
char *c = &lines[i][j];
//printf("Current char of line: %c\n", c[0]);
//If it's not a comma (or null character), add to current token
if(c[0] != 44 && c[0] != '\0') {
tokens[counter][tok_i] = c[0];
tok_i++;
} else {//If it is, terminate string and go to next token
tokens[counter][tok_i] = '\0';
printf("This token = %s\n",tokens[counter]);
counter++;
tok_i = 0;
}
}
the main reason your code didn't work was that you were writing to tokens[counter][j]
, where j
was your current position in the line. This is fine for the first token of a line, where the first character of the token is the first character of the line, but for subsequent tokens the first character of the token will be somewhere within the line where j
will not equal 0!
To fix this I've just included another counter, tok_i
for keeping track of where in the current token we currently are. This has to be incremented whenever we don't find a comma or null and reset whenever we do find a comma or null, when we know we are about to start a new token on the next loop.
With this method we have to explicitly check for the \0
character at the end of the string, at which point a second issue becomes apparent. strlen
gives the length of the string, excluding the \0
character, since we want to loop over the line including the \0
character we need to make the ending condition of our for
loop j<strlen(lines[i]) + 1
.
You'll also notice that there's little point in having strlen
inside the loop conditional: strlen(lines[i])
will not change over the course of the loop and yet we are asking strlen(lines[i])
to be evaluated each iteration, a small waste of time. This is probably fixed for us by the compiler, but just in case we fix it for sure by evaluating the breaking condition for the loop outside of the loop conditional, in the variable jmax
.
Other issues include that fscanf(data, "%s", &lines[i]);
will only work if the line you're fscanf
ing has no spaces in it. It's usual to use fgets
for these kind of scenarios, which takes the whole line including spaces.
Also, hardcoding the number of lines of the input file is also unnecessary, but could be acceptable if the input has very definite length.
Upvotes: 1
Reputation: 16512
My suggestion is to draw the diagram of your strings, Say you have this line and you'll find the first comma:
. 1 2
.01234567890123456789012
i -> |aaaa,bbb,cccccc,dddd,e\0
. ^
j
This is the tokens
array:
01234
counter |aaaa\0
Now you increment counter
but j
will continue, so next time you will have:
. 1 2
.01234567890123456789012
i -> |aaaa,bbb,cccccc,dddd,e\0
. ^
j
and the next line in the tokens
array will be:
01234 567
|aaaa\0
counter |????? bbb\0
Not exactly what you intended, right?
You should find another way to copy the characters in the token array.
May I suggest that if you need just to fill the token
array, you can get rid of the lines entirely and read the file one character at the time?
Also, I suppose this is just for practice as you did not mention the fact that a CSV may contain a comma within a string:
aaaa,"bb,bb",ccc
has three field.
Upvotes: 1