benimen16
benimen16

Reputation: 186

Regular expression for comments that begin and end on different lines

Okay, so I've been working on this problem for a couple of weeks now where I have a program that reads a file that contains some code from a mini language, read and then print each token with a description of what the token is. Part of this mini language is its ability to support single-line and multi-line comments.

The regular expression for comments are \{[^\}]*\} meaning:

Side Note: Comments cannot be nested, meaning that if I have a comment such as {This is a {nested} comment} would not be considered a valid comment because it can only have one closing curly brace. That being said, however, a comment such as {This is another {comment} would be valid since there is only one closing curly brace

While testing this program out, I ran into an issue where my program would read in a file and come across a multi-line comment, but instead of the program being able to recognize the comment as multi-lined, it just prints out what's inside of the comment, rather than the whole comment itself. I've spent a good week or week and a half on trying to get this to work. I've tried various combinations of regular expressions and where I place my if statements, but to come to no solution. I've tried everything I can to fix it but since I'm not very experienced with regular expressions I must be missing something pretty obvious.

Here I have a snippet of my code
Side Note: I have my program take in the name of the file through user input in another class.

import java.io.*;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;    

public class Analyzer {

    public void lex(String filename) {

        try {

            Scanner scanFile = new Scanner(file);

            while(scanFile.hasNextLine()) {

                String str = scanFile.nextLine();

                String keyword = "(\\bWHILE\\b|\\bENDWHILE\\b|\\bIF\\b|\\bENDIF\\b|\\bPRINT\\b)";
                String comment = "(\\{[^\\}]*\\})";
                String literal = "(\\b[0-9]+\\b)";
                String identifier = "(\\b[a-z]+\\b)";
                String symbol = "((\\()|(\\))|(;))";
                String operator = "((\\+)|(\\-)|(\\*)|(/)|(\\=)|(\\<)|(\\:\\=))";

                String keywordERROR = "(PRINT\\w+)";
                String commentERROR = "(\\{.*\\}.*\\})";
                String literalERROR = "([0-9]+[a-zA-Z_]+)";
                String identERROR = "([a-z]+[A-Z_0-9]+)";
                String alphabetERROR = "(~|`|\\!|@|#|\\$|%|\\^|\\&|_|\\||\\:|'|\"|\\?|\\>|\\.|\\,|\\\\)";

                String regex = keyword + "|" + keywordERROR + "|" + comment + "|" + commentERROR + "|" + literal + "|" + literalERROR
                    + "|" + identifier + "|" + identERROR + "|" + symbol + "|" + operator + "|" + alphabetERROR;

                Pattern pattern = Pattern.compile(regex);
                Matcher matcher = pattern.matcher(str);

                while(matcher.find()) {

                    if(matcher.group(1) != null)
                        System.out.println(matcher.group(1) + "\tKeyword");
                    else if(matcher.group(2) != null)
                        System.out.println(matcher.group(2) + "\tError");

                    if(matcher.group(3) != null)
                        System.out.println(matcher.group(3) + "\tComment");
                    else if(matcher.group(4) != null)
                        System.out.println(matcher.group(4) + "\tError");

                    if(matcher.group(5) != null)
                        System.out.println(matcher.group(5) + "\tLiteral");
                    else if(matcher.group(6) != null)
                        System.out.println(matcher.group(6) + "\tError");

                    if(matcher.group(7) != null)
                        System.out.println(matcher.group(7) + "\tIdentifier");
                    else if(matcher.group(8) != null)
                        System.out.println(matcher.group(8) + "\tError");

                    if(matcher.group(9) != null) {
                        if(matcher.group(10) != null)
                            System.out.println(matcher.group(10) + "\tOpen Parenthesis");
                        if(matcher.group(11) != null)
                            System.out.println(matcher.group(11) + "\tClose Parenthesis");
                        if(matcher.group(12) != null)
                            System.out.println(matcher.group(12) + "\tSemi-colon");
                    }

                    if(matcher.group(13) != null) {
                        if(matcher.group(14) != null)
                            System.out.println(matcher.group(14) + "\tAddition Operator");
                        if(matcher.group(15) != null)
                            System.out.println(matcher.group(15) + "\tSubtraction Operator");
                        if(matcher.group(16) != null)
                            System.out.println(matcher.group(16) + "\tMultiplication Operator");
                        if(matcher.group(17) != null)
                            System.out.println(matcher.group(17) + "\tDivision Operator");
                        if(matcher.group(18) != null)
                            System.out.println(matcher.group(18) + "\tEquality Comparison Operator");
                        if(matcher.group(19) != null)
                            System.out.println(matcher.group(19) + "\tLess Than Operator");
                        if(matcher.group(20) != null)
                            System.out.println(matcher.group(20) + "\tAssignment Operator");
                    }

                    if(matcher.group(21) != null) 
                        System.out.println(matcher.group(21) + "\tError");
                }
            }

            scanFile.close();

        } catch(Exception e) {
            e.printStackTrace();
        }
    }
}

Like I said before, I've tried many different ways on trying to solve for this issue. Some of the things I've tried were adding the return sequences like this: \{[^\}]*[\r\n]*\}, \{[\r\n]*[^\}]*\}, \{[\r\n]*[^\}]*[\r\n]*\}, \{[^\}]*\s*\}, \{\s*[^\}]*\s*\}, (?s)\{[^\}]*\} and (?m)\{[^\}]*\}, trying the DOTALL and MULTILINE flags for my Pattern object, and just looking for any tutorial I could find to use but I haven't had any luck.

The file that I'm reading from looks like this:

{This is
a multi-line
comment.}
WHILE(x<10)
    PRINT x;
    x:=x+2;
ENDWHILE

The output should look like this:

{This is a multi-line comment}    Comment
WHILE    Keyword
(    Open Parenthesis
x    Identifier
<    Less Than Operator
10   Literal
)    Close Parenthesis
PRINT    Keyword
x    Identifier
;    Semi-colon
x    Identifier
:=   Assignment Operator
x    Identifier
+    Addition Operator
2    Literal
;    Semi-colon
ENDWHILE    Keyword

But instead the output looks like this:

is  Identifier
a   Identifier
multi   Identifier
-   Subtraction Operator
line    Identifier
comment Identifier
.   Error
WHILE   Keyword
(   Open Parenthesis
x   Identifier
<   Less Than Operator
10  Literal
)   Close Parenthesis
PRINT   Keyword
x   Identifier
;   Semi-colon
x   Identifier
:=  Assignment Operator
x   Identifier
+   Addition Operator
2   Literal
;   Semi-colon
ENDWHILE    Keyword

I'm not sure what I'm doing wrong. Any and all help is greatly appreciated!

Upvotes: 0

Views: 55

Answers (2)

DevilsHnd - 退した
DevilsHnd - 退した

Reputation: 9202

You can just continue reading the file with another while loop IF your line starts with a open curly brace but doesn't end with a close curly brace, something like this:

while(scanFile.hasNextLine()) {
    String str = scanFile.nextLine().trim();  // trim off indents etc.

    // If the line is blank just read in the next line.
    if (str.equals("")) { continue; }

    // If this is a multi-line comment then
    if (str.startsWith("{") && !str.endsWith("}")) { 
        while(scanFile.hasNextLine()) {
            String commentStr = scanFile.nextLine().trim();
            str+= " " + commentStr;
            if (commentStr.endsWith("}")) { break; }
        }
    }

    // Do the rest of your processing....
    // ..................................
    // ..................................
}

On another note....I wouldn't use RegEx to parse this file content but perhaps you need to for some reason. Good RegEx exercise in any case. :)

Upvotes: 1

CosmicGiant
CosmicGiant

Reputation: 6439

The code doesn't work because Java's Pattern (used for regex) has MULTILINE mode disabled by default.

Try enabling it using (?m) at the start of the regex string. Or otherwise setting the Pattern's configuration to use MULTILINE.

AFAICS there is nothing (else) wrong with the \{[^\}]*\} regex, although you could probably use \{.*?\} instead, which is slightly more readable.

Upvotes: 0

Related Questions