Reputation: 347
This is a problem that I thought was resolved but evidently I'm still having small bugs here and there. The code below is what I'm using to parse a text file using a made language that I'm developing for a prototype microcontroller. Basically, anytime I reach a semicolon, I treat any text afterwards as a comment and ignore it:
`//Get characters from .j FILE`
while (fgets(line, 1000, IN) != NULL)
{
//Get each line of .j file
//Compute length of each line
len = strlen(line);
//If length is zero or if there is newline escape sequnce
if (len > 0 && line[len-1] == '\n')
{
//Replace with null
line[len-1] = '\0';
}
//Search for semicolons in .J FILE
semi_token = strpbrk(line, ";\r\n\t");
//Replace with null terminator
if (semi_token)
{
*semi_token = '\0';
}
printf("line is %s\n",line );
//Copy each line
assign = line;
// printf("line is %s\n",line );
// len = strlen(line);
// printf("line length is %d\n",len );
// parse_tok = strtok(line, "\r ");
}
The code above is the while loop that gets each line from the text file. If I have a file in the format below, everything works fine:
;;
;; Basic
;;
defun test arg3 arg2 arg1 min return
;defun love arg2 arg1 * return
;defun func_1 6 6 eq return
;defun func_2 20 100 / return
defun main
0 -200 55 test printnum endl
;1 2 3 test printnum endl
;38 23 8 test printnum endl
;5 6 7 love printnum endl
;love printnum endl
;func_1 printnum endl
;func_2 printnum endl
return
Observe output:
line is
line is
line is
line is
line is defun test arg3 arg2 arg1 min return
line is
line is
line is
line is
line is defun main
line is 0 -200 55 test printnum endl
line is
line is
line is
line is
line is
line is
line is return
The problem lies when my text file has tabs in the case where there are nested statements:
;;
;; program to test nested ifs
;;
defun testIfs ;; called with one parameter n
arg1 ; get n to the top of the stack
dup 16 gt
if ; 16 > n
dup 8 gt
if ; 8 > n
dup 4 gt
if ; 4 > n
0
else ; 4 <= n
1
endif
else ; 8 <= n
2
endif
else ; 16 <= n
dup 24 gt
if ; 24 > n
dup 20 gt
if ; 20 > n
3
else ; 20 <= n
4
endif
else ; 24 <= n
dup 32 gt
if ; 32 > n
5
else
-10
endif
endif
endif
return
defun main
5 testIfs printnum endl
11 testIfs printnum endl
28 testIfs printnum endl
35 testIfs printnum endl
return
Observe the output:
line is
line is
line is
line is
line is defun testIfs
line is
line is arg1
line is
line is dup 16 gt
line is if
line is
line is dup 8 gt
line is if
line is
line is
line is
line is
line is
line is
line is
line is
line is else
line is 2
line is endif
line is
line is else
line is
line is dup 24 gt
line is if
line is
line is
line is
line is 3
line is else
line is 4
line is
line is
line is else
line is
line is dup 32 gt
line is if
line is 5
line is
line is
line is endif
line is
line is endif
line is
line is endif
line is
line is return
line is
line is
line is defun main
line is 5 testIfs printnum endl
line is 11 testIfs printnum endl
line is 28 testIfs printnum endl
line is 35 testIfs printnum endl
line is return
As you can see, it skips (seemingly randomly) certain lines that are tabbed and I don't know why it is doing this. What needs to be modified in my code so that it will not randomly skip certain lines that are tabbed? Any help is appreciated!
Upvotes: 0
Views: 61
Reputation: 84569
As others have pointed out, your use of strpbrk (line, ";\r\n\t");
will return a pointer to the first ';', '\r', '\n'
or \t'
in line
. If your file includes tab characters (which it shouldn't unless it is a Makefile) for indention, you potentially nul-terminate your line at the very beginning. This isn't what you want.
However, your choice of strpbrk
is a good one for the task. If you remove the '\t'
from your accept set of characters, you will then be closer to achieving what you intend. (you can remove the '\r'
as well as the line-endings will be converted to '\n'
on read)
In a very simple version of your code where you do not worry about trimming any of the trailing whitespace between the last non-whitespace character and then beginning of the comment (or end of line), you can do something as simple as nul-terminating the line at the pointer returned by strpbrk
, e.g.
#include <stdio.h>
#include <string.h>
#define MAXC 1024
int main (void) {
char line[MAXC] = "";
size_t lineno = 0;
/* read each line from stdin (e.g. redirect file, ./prog <file) */
while (fgets(line, MAXC, stdin) != NULL)
{
char *p = NULL; /* pointer for strchr return */
/* Search for semicolons in line or newline */
if ((p = strpbrk (line, ";\n")))
*p = 0; /* nul-terminate at ';' or '\n' */
/* output line (single-quotes simply show trim of whitespace) */
printf ("%3zu: '%s'\n", ++lineno, line);
}
return 0;
}
Example Use/Output
note: single-quotes have been included around the output to demonstrate the trailing whitespace left.
$ ./bin/parsesemisimple <dat/semicmtfile.txt
1: ''
2: ''
3: ''
4: ''
5: 'defun testIfs '
6: ''
7: 'arg1 '
...
Notice how the line "arg1 ; get n to the top of the stack"
has 10-spaces after the end of arg1
and the comment character. It's never a good idea to leave dangling whitespace.
To remove the trailing whitespace, you can include ctype.h
and use it's isspace
function to test whether any of the characters that precede the comment are whitespace, and if some, simply keep backing up until you find the late non-whitespace character. Once you find the last non-whitespace character, you then terminate after it.
You can add a few lines of code to your strpbrk
conditional to do just that. Note: when backing up, you always want to make sure (p > line)
so you don't backup past the start of line
, and you also know if p
isn't greater than line
, the comment begins there or it was a blank line. You could do something like the following:
#include <ctype.h>
...
/* Search for semicolons in line or newline */
if ((p = strpbrk (line, ";\n"))) {
if (p > line) { /* test characters in line */
/* remove all trailing whitespace */
while (p > line && isspace (*--p)) {}
*++p = 0; /* nul-terminate after last non-whitespace char */
} /* before ';' or end of line */
else
*p = 0; /* otherwise nul-terminate at ';' */
}
(If you are not familiar with C Operator Precedence, now would be a good opportunity to make friends with it. Pay attention to the column describing whether the association is right to left
or left to right
, it makes a difference)
Example Use/Output
Now you can check the full output and confirm that the comments and all trailing whitespace have been removed. (you can remove the single-quotes when you are satisfied all is working as it should)
$ ./bin/parsesemicmt <dat/semicmtfile.txt
1: ''
2: ''
3: ''
4: ''
5: 'defun testIfs'
6: ''
7: 'arg1'
8: ''
9: 'dup 16 gt'
10: 'if'
11: ''
12: ' dup 8 gt'
13: ' if'
14: ''
15: ' dup 4 gt'
16: ' if'
17: ' 0'
18: ' else'
19: ' 1'
20: ' endif'
21: ''
22: ' else'
23: ' 2'
24: ' endif'
25: ''
26: 'else'
27: ''
28: ' dup 24 gt'
29: ' if'
30: ''
31: ' dup 20 gt'
32: ' if'
33: ' 3'
34: ' else'
35: ' 4'
36: ' endif'
37: ''
38: ' else'
39: ''
40: ' dup 32 gt'
41: ' if'
42: ' 5'
43: ' else'
44: ' -10'
45: ' endif'
46: ''
47: ' endif'
48: ''
49: 'endif'
50: ''
51: 'return'
52: ''
53: ''
54: 'defun main'
55: '5 testIfs printnum endl'
56: '11 testIfs printnum endl'
57: '28 testIfs printnum endl'
58: '35 testIfs printnum endl'
59: 'return'
Note: as indicated in the code you have commented out, if you intend to call strtok
, there isn't a need to remove the trailing whitespace. If you include a space as one of the tokens when tokenizing line
, all sequential occurrences will be considered a single token and removed there.
Look things over and let me know if you have any question. If I misinterpreted you question, let me know and I'm happy to test further.
Upvotes: 1
Reputation: 28269
Here is the part that looks for semicolons:
//Search for semicolons in .J FILE
semi_token = strpbrk(line, ";\r\n\t");
It explicitly treats tab characters the same as semicolons, i.e. starting a comment. As for why the bug doesn't always happen - I guess sometimes your editor converts a tab (\t
) character in your *.J
input file into spaces.
Upvotes: 4