J_code
J_code

Reputation: 347

Parsing text file (unresolved issues)

This is a problem that I thought was resolved but evidently I'm still having small bugs here and there. The code below is what I'm using to parse a text file using a made language that I'm developing for a prototype microcontroller. Basically, anytime I reach a semicolon, I treat any text afterwards as a comment and ignore it:

   `//Get characters from .j FILE`
    while (fgets(line, 1000, IN) != NULL)
    {
        //Get each line of .j file


        //Compute length of each line
        len = strlen(line);

        //If length is zero or if there is newline escape sequnce
        if (len > 0 && line[len-1] == '\n')
        {
            //Replace with null
            line[len-1] = '\0';
        }

        //Search for semicolons in .J FILE
        semi_token = strpbrk(line, ";\r\n\t");

        //Replace with null terminator
        if (semi_token) 
        {
            *semi_token = '\0';
        }
        printf("line is %s\n",line );

        //Copy each line
        assign = line;

        // printf("line is %s\n",line );

        // len = strlen(line);

        // printf("line length is %d\n",len );

        // parse_tok = strtok(line, "\r ");

    }   

The code above is the while loop that gets each line from the text file. If I have a file in the format below, everything works fine:

;;
;; Basic
;;

defun test arg3 arg2 arg1 min return 
;defun love arg2 arg1 * return
;defun func_1 6 6 eq return
;defun func_2 20 100 / return

defun main
0 -200 55 test printnum endl
;1 2 3 test printnum endl
;38 23 8 test printnum endl
;5 6 7 love printnum endl
;love printnum endl
;func_1 printnum endl
;func_2 printnum endl
return

Observe output:

line is 
line is 
line is 
line is 
line is defun test arg3 arg2 arg1 min return 
line is 
line is 
line is 
line is 
line is defun main
line is 0 -200 55 test printnum endl
line is 
line is 
line is 
line is 
line is 
line is 
line is return

The problem lies when my text file has tabs in the case where there are nested statements:

;;
;; program to test nested ifs
;;

defun testIfs ;; called with one parameter n

arg1           ; get n to the top of the stack

dup 16 gt
if   ; 16 > n

    dup 8 gt
    if  ; 8 > n

        dup 4 gt
    if  ; 4 > n
        0
    else        ; 4 <= n
        1
    endif

    else        ; 8 <= n
       2
    endif

else        ; 16 <= n

     dup 24 gt
     if ; 24 > n

        dup 20 gt
    if  ; 20 > n
           3
        else        ; 20 <= n
           4
    endif

     else           ; 24 <= n

        dup 32 gt
        if  ; 32 > n
           5
    else
        -10
        endif

     endif

endif

return


defun main 
5 testIfs printnum endl
11 testIfs printnum endl
28 testIfs printnum endl
35 testIfs printnum endl
return

Observe the output:

line is 
line is 
line is 
line is 
line is defun testIfs 
line is 
line is arg1           
line is 
line is dup 16 gt
line is if   
line is 
line is     dup 8 gt
line is     if
line is 
line is     
line is 
line is 
line is 
line is 
line is 
line is 
line is     else
line is        2
line is     endif
line is 
line is else
line is 
line is      dup 24 gt
line is      if
line is 
line is      
line is 
line is            3
line is         else   
line is            4
line is 
line is 
line is      else   
line is 
line is         dup 32 gt
line is         if
line is            5
line is 
line is 
line is         endif
line is 
line is      endif
line is 
line is endif
line is 
line is return
line is 
line is 
line is defun main 
line is 5 testIfs printnum endl
line is 11 testIfs printnum endl
line is 28 testIfs printnum endl
line is 35 testIfs printnum endl
line is return

As you can see, it skips (seemingly randomly) certain lines that are tabbed and I don't know why it is doing this. What needs to be modified in my code so that it will not randomly skip certain lines that are tabbed? Any help is appreciated!

Upvotes: 0

Views: 61

Answers (2)

David C. Rankin
David C. Rankin

Reputation: 84569

As others have pointed out, your use of strpbrk (line, ";\r\n\t"); will return a pointer to the first ';', '\r', '\n' or \t' in line. If your file includes tab characters (which it shouldn't unless it is a Makefile) for indention, you potentially nul-terminate your line at the very beginning. This isn't what you want.

However, your choice of strpbrk is a good one for the task. If you remove the '\t' from your accept set of characters, you will then be closer to achieving what you intend. (you can remove the '\r' as well as the line-endings will be converted to '\n' on read)

In a very simple version of your code where you do not worry about trimming any of the trailing whitespace between the last non-whitespace character and then beginning of the comment (or end of line), you can do something as simple as nul-terminating the line at the pointer returned by strpbrk, e.g.

#include <stdio.h>
#include <string.h>

#define MAXC 1024

int main (void) {

    char line[MAXC] = "";
    size_t lineno = 0;

    /* read each line from stdin (e.g. redirect file, ./prog <file) */
    while (fgets(line, MAXC, stdin) != NULL)
    {
        char *p = NULL;         /* pointer for strchr return */

        /* Search for semicolons in line or newline */
        if ((p = strpbrk (line, ";\n")))
            *p = 0;             /* nul-terminate at ';' or '\n' */

        /* output line (single-quotes simply show trim of whitespace) */
        printf ("%3zu: '%s'\n", ++lineno, line);
    }

    return 0;
}

Example Use/Output

note: single-quotes have been included around the output to demonstrate the trailing whitespace left.

$ ./bin/parsesemisimple <dat/semicmtfile.txt
  1: ''
  2: ''
  3: ''
  4: ''
  5: 'defun testIfs '
  6: ''
  7: 'arg1           '
  ...

Notice how the line "arg1 ; get n to the top of the stack" has 10-spaces after the end of arg1 and the comment character. It's never a good idea to leave dangling whitespace.

To remove the trailing whitespace, you can include ctype.h and use it's isspace function to test whether any of the characters that precede the comment are whitespace, and if some, simply keep backing up until you find the late non-whitespace character. Once you find the last non-whitespace character, you then terminate after it.

You can add a few lines of code to your strpbrk conditional to do just that. Note: when backing up, you always want to make sure (p > line) so you don't backup past the start of line, and you also know if p isn't greater than line, the comment begins there or it was a blank line. You could do something like the following:

#include <ctype.h>
...
        /* Search for semicolons in line or newline */
        if ((p = strpbrk (line, ";\n"))) {
            if (p > line) {         /* test characters in line */
                /* remove all trailing whitespace */
                while (p > line && isspace (*--p)) {}
                *++p = 0;   /* nul-terminate after last non-whitespace char */
            }               /* before ';' or end of line */
            else
                *p = 0;     /* otherwise nul-terminate at ';' */
        }

(If you are not familiar with C Operator Precedence, now would be a good opportunity to make friends with it. Pay attention to the column describing whether the association is right to left or left to right, it makes a difference)

Example Use/Output

Now you can check the full output and confirm that the comments and all trailing whitespace have been removed. (you can remove the single-quotes when you are satisfied all is working as it should)

$ ./bin/parsesemicmt <dat/semicmtfile.txt
  1: ''
  2: ''
  3: ''
  4: ''
  5: 'defun testIfs'
  6: ''
  7: 'arg1'
  8: ''
  9: 'dup 16 gt'
 10: 'if'
 11: ''
 12: '    dup 8 gt'
 13: '    if'
 14: ''
 15: '        dup 4 gt'
 16: '    if'
 17: '        0'
 18: '    else'
 19: '        1'
 20: '    endif'
 21: ''
 22: '    else'
 23: '       2'
 24: '    endif'
 25: ''
 26: 'else'
 27: ''
 28: '     dup 24 gt'
 29: '     if'
 30: ''
 31: '        dup 20 gt'
 32: '    if'
 33: '           3'
 34: '        else'
 35: '           4'
 36: '    endif'
 37: ''
 38: '     else'
 39: ''
 40: '        dup 32 gt'
 41: '        if'
 42: '           5'
 43: '    else'
 44: '        -10'
 45: '        endif'
 46: ''
 47: '     endif'
 48: ''
 49: 'endif'
 50: ''
 51: 'return'
 52: ''
 53: ''
 54: 'defun main'
 55: '5 testIfs printnum endl'
 56: '11 testIfs printnum endl'
 57: '28 testIfs printnum endl'
 58: '35 testIfs printnum endl'
 59: 'return'

Note: as indicated in the code you have commented out, if you intend to call strtok, there isn't a need to remove the trailing whitespace. If you include a space as one of the tokens when tokenizing line, all sequential occurrences will be considered a single token and removed there.

Look things over and let me know if you have any question. If I misinterpreted you question, let me know and I'm happy to test further.

Upvotes: 1

anatolyg
anatolyg

Reputation: 28269

Here is the part that looks for semicolons:

    //Search for semicolons in .J FILE
    semi_token = strpbrk(line, ";\r\n\t");

It explicitly treats tab characters the same as semicolons, i.e. starting a comment. As for why the bug doesn't always happen - I guess sometimes your editor converts a tab (\t) character in your *.J input file into spaces.

Upvotes: 4

Related Questions