Logick
Logick

Reputation: 331

Delete all comments in a file using sed

How would you delete all comments using sed from a file(defined with #) with respect to '#' being in a string?

This helped out a lot except for the string portion.

Upvotes: 13

Views: 23562

Answers (7)

mvdan
mvdan

Reputation: 536

As you have pointed out, sed won't work well if any parts of a script look like comments but actually aren't. For example, you could find a # inside a string, or the rather common $# and ${#param}.

I wrote a shell formatter called shfmt, which has a feature to minify code. That includes removing comments, among other things:

$ cat foo.sh
echo $# # inline comment
# lone comment
echo '# this is not a comment'
[mvdan@carbon:12] [0] [/home/mvdan]
$ shfmt -mn foo.sh
echo $#
echo '# this is not a comment'

The parser and printer are Go packages, so if you'd like a custom solution, it should be fairly easy to write a 20-line Go program to remove comments in the exact way that you want.

Upvotes: 0

jwfearn
jwfearn

Reputation: 29597

To remove comment lines (lines whose first non-whitespace character is #) but not shebang lines (lines whose first characters are #!):

sed '/^[[:space:]]*#[^!]/d; /#$/d' file

The first argument to sed is a string containing a sed program consisting of two delete-line commands of the form /regex/d. Commands are separated by ;. The first command deletes comment lines but not shebang lines. The second command deletes any remaining empty comment lines. It does not handle trailing comments.

The last argument to sed is a file to use as input. In Bash, you can also operate on a string variable like this:

sed '/^[[:space:]]*#[^!]/d; /#$/d' <<< "${MYSTRING}"

Example:

# test.sh
S0=$(cat << HERE
#!/usr/bin/env bash
# comment
  # indented comment
echo 'FOO' # trailing comment
# last line is an empty, indented comment
  #
HERE
)
printf "\nBEFORE removal:\n\n${S0}\n\n"
S1=$(sed '/^[[:space:]]*#[^!]/d; /#$/d' <<< "${S0}")
printf "\nAFTER removal:\n\n${S1}\n\n"

Output:

$ bash test.sh

BEFORE removal:

#!/usr/bin/env bash
# comment
  # indented comment
echo 'FOO' # trailing comment
# last line is an empty, indented comment
  #    


AFTER removal:

#!/usr/bin/env bash
echo 'FOO' # trailing comment

Upvotes: 3

Harshad Yeola
Harshad Yeola

Reputation: 1190

sed 's:^#\(.*\)$:\1:g' filename

Supposing the lines starts with single # comment, Above command removes all comments from file.

Upvotes: -1

tripleee
tripleee

Reputation: 189317

Supposing "being in a string" means "occurs between a pair of quotes, either single or double", the question can be rephrased as "remove everything after the first unquoted #". You can define the quoted strings, in turn, as anything between two quotes, excepting backslashed quotes. As a minor refinement, replace the entire line with everything up through just before the first unquoted #.

So we get something like [^\"'#] for the trivial case -- a piece of string which is neither a comment sign, nor a backslash, nor an opening quote. Then we can accept a backslash followed by anything: \\. -- that's not a literal dot, that's a literal backslash, followed by a dot metacharacter which matches any character.

Then we can allow zero or more repetitions of a quoted string. In order to accept either single or double quotes, allow zero or more of each. A quoted string shall be defined as an opening quote, followed by zero or more of either a backslashed arbitrary character, or any character except the closing quote: "\(\\.\|[^\"]\)*" or similarly for single-quoted strings '\(\\.\|[^\']\)*'.

Piecing all of this together, your sed script could look something like this:

s/^\([^\"'#]*\|\\.\|"\(\\.\|[^\"]\)*"\|'\(\\.\|[^\']\)*'\)*\)#.*/\1/

But because it needs to be quoted, and both single and double quotes are included in the string, we need one more additional complication. Recall that the shell allows you to glue together strings like "foo"'bar' gets replaced with foobar -- foo in double quotes, and bar in single quotes. Thus you can include single quotes by putting them in double quotes adjacent to your single-quoted string -- '"foo"'"'" is "foo" in single quotes next to ' in double quotes, thus "foo"'; and "' can be expressed as '"' adjacent to "'". And so a single-quoted string containing both double quotes foo"'bar can be quoted with 'foo"' adjacent to "'bar" or, perhaps more realistically for this case 'foo"' adjacent to "'" adjacent to another single-quoted string 'bar', yielding 'foo'"'"'bar'.

sed 's/^\(\(\\.\|[^\#"'"'"']*\|"\(\\.\|[^\"]\)*"\|'"'"'\(\\.\|[^\'"'"']\)*'"'"'\)*\)#.*/\1/p' file

This was tested on Linux; on other platforms, the sed dialect may be slightly different. For example, you may need to omit the backslashes before the grouping and alteration operators.

Alas, if you may have multi-line quoted strings, this will not work; sed, by design, only examines one input line at a time. You could build a complex script which collects multiple lines into memory, but by then, switching to e.g. Perl starts to make a lot of sense.

Upvotes: 1

potong
potong

Reputation: 58371

This might work for you (GNU sed):

sed '/#/!b;s/^/\n/;ta;:a;s/\n$//;t;s/\n\(\("[^"]*"\)\|\('\''[^'\'']*'\''\)\)/\1\n/;ta;s/\n\([^#]\)/\1\n/;ta;s/\n.*//' file
  • /#/!b if the line does not contain a # bail out
  • s/^/\n/ insert a unique marker (\n)
  • ta;:a jump to a loop label (resets the substitute true/false flag)
  • s/\n$//;t if marker at the end of the line, remove and bail out
  • s/\n\(\("[^"]*"\)\|\('\''[^'\'']*'\''\)\)/\1\n/;ta if the string following the marker is a quoted one, bump the marker forward of it and loop.
  • s/\n\([^#]\)/\1\n/;ta if the character following the marker is not a #, bump the marker forward of it and loop.
  • s/\n.*// the remainder of the line is comment, remove the marker and the rest of line.

Upvotes: 7

livibetter
livibetter

Reputation: 20450

Since there is no sample input provided by asker, I will assume a couple of cases and Bash is the input file because bash is used as the tag of the question.

Case 1: entire line is the comment

The following should be sufficient enough in most case:

sed '/^\s*#/d' file

It matches any line has which has none or at least one leading white-space characters (space, tab, or a few others, see man isspace), followed by a #, then delete the line by d command.

Any lines like:

# comment started from beginning.
         # any number of white-space character before
    # or 'quote' in "here"

They will be deleted.

But

a="foobar in #comment"

will not be deleted, which is the desired result.

Case 2: comment after actual code

For example:

if [[ $foo == "#bar" ]]; then # comment here

The comment part can be removed by

sed "s/\s*#*[^\"']*$//" file

[^\"'] is used to prevent quoted string confusion, however, it also means that comments with quotations ' or " will not to be removed.

Final sed

sed "/^\s*#/d;s/\s*#[^\"']*$//" file

Upvotes: 5

beatgammit
beatgammit

Reputation: 20205

If # always means comment, and can appear anywhere on a line (like after some code):

sed 's:#.*$::g' <file-name>

If you want to change it in place, add the -i switch:

sed -i 's:#.*$::g' <file-name>

This will delete from any # to the end of the line, ignoring any context. If you use # anywhere where it's not a comment (like in a string), it will delete that too.

If comments can only start at the beginning of a line, do something like this:

sed 's:^#.*$::g' <file-name>

If they may be preceded by whitespace, but nothing else, do:

sed 's:^\s*#.*$::g' <file-name>

These two will be a little safer because they likely won't delete valid usage of # in your code, such as in strings.

Edit:

There's not really a nice way of detecting whether something is in a string. I'd use the last two if that would satisfy the constraints of your language.

The problem with detecting whether you're in a string is that regular expressions can't do everything. There are a few problems:

  • Strings can likely span lines
  • A regular expression can't tell the difference between apostrophies and single quotes
  • A regular expression can't match nested quotes (these cases will confuse the regex):

    # "hello there"
    # hello there"
    "# hello there"
    

If double quotes are the only way strings are defined, double quotes will never appear in a comment, and strings cannot span multiple lines, try something like this:

sed 's:#[^"]*$::g' <file-name>

That's a lot of pre-conditions, but if they all hold, you're in business. Otherwise, I'm afraid you're SOL, and you'd be better off writing it in something like Python, where you can do more advanced logic.

Upvotes: 15

Related Questions