Match comment, but not inside a string

Question

I’d like to match comments in Perl.

double or single quotes strings are not strings if inside comments
#s inside strings are not comments

Here is an example, each string and comments needs to be captured and highlighted later.

# this is a comment, should be matched.
# # "I am not a string" . 'because I am inside a comment'
my $string = " #I am not a comment, because I am quoted";
my $another_string = "I am a multiline string with # on
                      each line #, have fun!";
my $descap_string = "I am a \ escaped \" \"string"; # and some comments;
my $sescap_string = 'I am a \ escaped \' \'string'; # and some comments;
my $empty_d ="";
my $empty_s ='';

I tried few things, but could not work out a solution to cover all the situations.

Miller · Accepted Answer

To do this, you simply need to rely on the ordered nature of the code. Basically, come up with your regular expressions for quotes and comments, and put them in an or'd list in a single regex.

The following is a stub of what I'm talking about:

use strict;
use warnings;

my $dquo_re = qr{...};
my $squo_re = qr{...};
my $comment_re = qr{...};

my $src = do {local $/; };

while ($src =~ /($dquo_re)|($squo_re)|($comment_re)/g) {
    if (defined $1) {
        print "Double quote found: $1
";
    } elsif (defined $2) {
        print "Single quote found: $2
";
    } elsif (defined $3) {
        print "Comment found: $3
";
    }
}

__DATA__
# this is a comment, should be matched.
# "I am not a string" . 'because I am inside a comment'
my $string = " #I am not a comment, because I am quoted";
my $another_string = "I am a multiline string with # on 
                      each line #, have fun!";

Update

Because you've shown your work and come up with your own solution, I will reveal 3 regular expressions that will match most cases of single and double quoted strings and comments.

my $dquo_re = qr{"(?:(?>[^"\]+)|\.)*"};
my $squo_re = qr{'(?:(?>[^'\]+)|\.)*'};
my $comment_re = qr{(?



Outputs:

Comment found: # this is a comment, should be matched.
Comment found: # "I am not a string" . 'because I am inside a comment'
Double quote found: " #I am not a comment, because I am quoted"
Double quote found: "I am a multiline string with # on
                      each line #, have fun!"


Btw, the most complete way of doing this is using PPI

use strict;
use warnings;

use PPI;

my $src = do {local $/; };

# Load a document
my $doc = PPI::Document->new( \$src );

my $matches = $doc->find(sub{
    grep {$_[1]->isa("PPI::Token::$_")} qw(Comment Quote)
});

for (@$matches) {
    if ($_->isa('PPI::Token::Comment')) {
        print "Comment: ", $_->content;
    } elsif ($_->isa('PPI::Token::Quote')) {
        print "Quote: ", $_->content, "
";
    }
}

__DATA__
# this is a comment, should be matched.
# "I am not a string" . 'because I am inside a comment'
my $string = " #I am not a comment, because I am quoted";
my $another_string = "I am a multiline string with # on 
                      each line #, have fun!";

Match comment, but not inside a string

Answers (2)

Related Questions