Halle
Halle

Reputation: 3584

How can I match and modify C and C++ comments with Perl?

I have the task of (trying to) do a search and replace within a large codebase for a word suffix, only when it occurs within comments. All of the comments are of the /* or // type but they are guaranteed to include most of the edge cases imaginable.

So I want to change this:

/* blah blah something__suffix blah */

to this:

/* blah blah something blah */

but I also want to change this:

// blah blah something__suffix blah 

to this:

// blah blah something blah 

And this:

/*
 * blah blah something__suffix blah 
 */

to this:

/*
 * blah blah something blah 
 */

And this:

/** 

// blah blah something__suffix blah 

*/

To this:

/** 

// blah blah something blah 

*/

ad nauseam (literally).

Initially I felt that this was a parser task and I installed cochinelle, and indeed it could parse my comments but it got stuck with my preprocessor macros and the workarounds seemed complex for someone who just has this as a one-off task. So now I'm considering regex.

I haven't found a lot of advice around about doing really robust search and replace within C and C++ comments with regex (besides "you need a parser"), but I did notice that there seems to be a pretty well road-tested perl script on the perl FAQ for removing comments in both of these styles here.

as follows:

$/ = undef;
$_ = <>;

s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;

print;

My question: how to adapt this script so that instead of stripping the comment, the text that has been identified as a comment can then be searched for the suffix and the suffix removed, leaving the rest of the comment intact?

Upvotes: 2

Views: 330

Answers (2)

ikegami
ikegami

Reputation: 386331

You need to do it in two steps because you might have

/* foo__suffix bar__suffix */

First, extract the comment, then substitute any __suffix in the comment.

s{
   \G
   (?:(?!/[*/]).)*
   \K
   (   /[*] (?:(?![*]/).)* [*]/
   |   //   [^\n]*
   )
}{
   my $comment = $1;
   $comment =~ s/(?<=\w)__suffix//g;
   $comment
}xes;

Notes:

  • (?:(?!STRING).) is to (?:STRING) as [^CHAR] is to CHAR.

  • My solution will mess up if you have // or /* in a string literal.

  • If you're ok with removing instances of __suffix that aren't preceded by an identifier, you can remove the (?<=\w).

  • If you're using 5.14 or higher, you can simplify

    s{...}{
       my $comment = $1;
       $comment =~ s/(?<=\w)__suffix//g;
       $comment
    }xes;
    

    to

    s{...}{
       $1 =~ s/(?<=\w)__suffix//rg
    }xes;
    

Upvotes: 1

simbabque
simbabque

Reputation: 54373

I'm not sure if this is a good solution, but it works.

use strict; use warnings; use feature qw(say);
my @lines = (
qq~Example 1:
/* blah blah something__suffix blah */~,
qq~Example 2:
// blah blah something__suffix blah needs a newline at the end
~,
qq~Example 3:
/*
 * blah blah something__suffix blah 
 */~,
qq~Example 4:
/** 

// blah blah something__suffix blah 

*/~,
qq~Example 5 (string):
foobar '// blah blah something__suffix blah '~,
qq~Example 6:
public void main { return; } // this does__suffix nothing but needs newline
~,
);

foreach (@lines) {
  print "Before:\n$_\n";
  s!/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)!
  { if (defined $3) { $3 } else { (my $temp = ${^MATCH}) =~ s/__suffix//; $temp;} } 
  !gsepx;

  print "After:\n$_\n\n";
}

It's probably not very efficient, but I don't think that is important for your job.

Upvotes: 1

Related Questions