sep_The_new_elixiR
sep_The_new_elixiR

Reputation: 33

How can i ignore comments in a string based on compiler design?

I want to ignore every comment like { comments } and // comments. I have a pointer named peek that checks my string character by character. I know how to ignore newlines, tabs, and spaces but I don't know how to ignore comments.

string =  """  beGIn west   WEST north//comment1 \n
north       north west East east south\n
// comment west\n
{\n
    comment\n
}\n end
"""

tokens = []
tmp = ''

for i, peek in enumerate(string.lower()):
    if peek == ' ' or peek == '\n':
        tokens.append(tmp)
        # ignoing WS's and comments
        if(len(tmp)>0): 
            print(tmp)

        tmp = ''
    
    else:
        tmp += peek

Here is my result:

begin
west
west
north//
comment1
north
north
west
east
east
south
{
comment2
}
end

As you see spaces are ignored but comments aren't.

How can I get a result like below?

begin
west
west
north
north
north
west
east
east
south
end

Upvotes: 0

Views: 993

Answers (2)

Gaslight Deceive Subvert
Gaslight Deceive Subvert

Reputation: 20400

@furas answer works, but to make it count newlines properly, use the _ decorator:

@_('{(.|\n)*}')
def MULTILINE_COMMENT(self, t):
    self.lineno += t.value.count('\n')
    return t

Upvotes: 0

furas
furas

Reputation: 142868

Simply use global variable skip = False and set it True when you get { and set False when you get } and the rest of your if/else run in if not skip:

string =  """  beGIn west   WEST north//comment1 \n
north       north west East east south\n
// comment west\n
{\n
    comment\n
}\n end
"""

tokens = []
tmp = ''
skip = False

for i, peek in enumerate(string.lower()):

    if peek == '{':
        skip = True
    elif peek == '}':
        skip = False
    elif not skip:

        if peek == ' ' or peek == '\n':
            tokens.append(tmp)
            # ignoing WS's and comments
            if(len(tmp)>0): 
                print(tmp)
            tmp = ''
        else:
            tmp += peek

Because you may have nested { { } } like

{\n
    { comment1 }\n
    comment2\n
    { comment3 }\n
}\n

so better use skip to count { }

string =  """  beGIn west   WEST north//comment1 \n
north       north west East east south\n
// comment west\n
{\n
    { comment1 }\n
    comment2\n
    { comment3 }\n
}\n end
"""

tokens = []
tmp = ''
skip = 0

for i, peek in enumerate(string.lower()):

    if peek == '{':
        skip += 1
    elif peek == '}':
        skip -= 1
    elif not skip:  # elif skip == 0:

        if peek == ' ' or peek == '\n':
            tokens.append(tmp)
            # ignoing WS's and comments
            if(len(tmp)>0): 
                print(tmp)
            tmp = ''
        else:
            tmp += peek

But maybe it would be better to get all as tokens and later filter tokens. But I skip this idea.


EDIT:

Version using Python module sly which works similar to C/C++ tools lex/yacc

Regex for MULTI_LINE_COMMENT I found in other tool for building parsers - lark:

syntax for multiline comments

from sly import Lexer, Parser

class MyLexer(Lexer):
    # Create it befor defining regex for Tokens
    tokens = { NAME, ONE_LINE_COMMENT, MULTI_LINE_COMMENT }

    ignore = ' \t'

    # Tokens
    NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
    ONE_LINE_COMMENT = '\/\/.*'
    MULTI_LINE_COMMENT = '{(.|\n)*}'

    # Ignored pattern
    ignore_newline = r'\n+'

    # Extra action for newlines
    def ignore_newline(self, t):
        self.lineno += t.value.count('\n')

    # Work with errors
    def error(self, t):
        print("Illegal character '%s'" % t.value[0])
        self.index += 1

if __name__ == '__main__':
    
    text =  """  beGIn west   WEST north//comment1 
north       north west East east south
// comment west
{
    { comment1 }
    comment2
    { comment3 }
}
 end
"""
    
    lexer = MyLexer()
    tokens = lexer.tokenize(text)
    for item in tokens:
        print(item.type, ':', item.value)

Result:

NAME : beGIn
NAME : west
NAME : WEST
NAME : north
ONE_LINE_COMMENT : //comment1 
NAME : north
NAME : north
NAME : west
NAME : East
NAME : east
NAME : south
ONE_LINE_COMMENT : // comment west
MULTI_LINE_COMMENT : {
    { comment1 }
    comment2
    { comment3 }
}
NAME : end

Upvotes: 1

Related Questions