Petro
Petro

Reputation: 23

Replace a pattern between lines

I am trying to replace a pattern between the lines of a file.

Specifically, I would like to replace ,\n & with , &\n in large and multiple files. This actually moves the symbol & to the previous line. This is very easy with CTR+H, but I found it difficult with sed.

So, the initial file is in the following form:

      A +,
   &  B -,
   &  C ),
   &  D +,
   &  E (,
   &  F *,
 # &  G -,
   &  H +,
   &  I (,
   &  J +,
      K ?,

The output-desired form is:

      A +, &
      B -, &
      C ), &
      D +, &
      E (, &
      F *, &
#  &  G -,
      H +, &
      I (, &
      J +,
      K ?,

Following previous answered questions on stackoverflow, I tried to convert it with the commands below:

sed ':a;N;$!ba;s/,\n &/&\n /g' file1.txt > file2.txt

sed -i -e '$!N;/&/b1' -e 'P;D' -e:1 -e 's/\n[[:space:]]*/ /' file2.txt

but they fail if the symbol "#" is present in the file.

Is there any way to replace the matched pattern simpler, let's say: sed -i 's/,\n &/, &\n /g' file

Thank you in advance!

Upvotes: 2

Views: 145

Answers (5)

potong
potong

Reputation: 58420

This might work for you (GNU sed):

sed -E '/,$/{:a;N;/#[^\n]*$/ba
        s/,((\n.*)*)\n(\s*)&/, \&\1\n\3 /;h;s/(.*)\n.*/\1/p;g;s/.*\n(.*\n)/\1/;D}' file

Form a two line window (but include comments too if necessary).

Format the first line and print it (with comments if found).

Remove all but the last two lines.

Delete the first of the two lines left and repeat.

Upvotes: 1

Bodo
Bodo

Reputation: 9855

Assuming that the line

 # &  G -,

is a commented line which could get uncommented later, it might make sense to handle the & in this line as well. Not knowing the purpose of the data, this might or might not be useful.

With GNU Awk, the command

awk 'BEGIN { RS=",";ORS="" } { printf "%s%s", ORS, gensub(/(\n[ \t#]*)&/, " \\&\\1 ",1); ORS=RS }' inputfile

will turn the input

      A +,
   &  B -,
   &  C ),
   &  D +,
   &  E (,
   &  F *,
 # &  G -,
   &  H +,
   &  I (,
   &  J +,
      K ?,

into

      A +, &
      B -, &
      C ), &
      D +, &
      E (, &
      F *, &
 #    G -, &
      H +, &
      I (, &
      J +,
      K ?,

This script will only work correct if the last line is terminated by a newline or if any other character follows the ,.

Explanation:

  • RS="," sets the comma as record separator instead of a newline for input.
  • ORS="" sets the output record separator to an empty string before the first record.
  • fprintf "%s%s", ORS, gensub(...) prepends the record separator instead of appending it.
  • gensub GNU specific substitution function which allows backreferences to matched groups.
  • /(\n[ \t#]*)&/ search pattern: The parentheses define a group (1) that consists of a newline \n followed by any sequence of spaces, tabs or comment characters [ \t#]*. The group is followed by an & character.
  • " \\&\\1 " replacement: space followed by &, followed by captured group (1) (\\1) and an additional space to replace the removed &. (The \\& is necessary to get a literal & character instead of inserting the whole match.)
  • ORS=RS sets the output record separator to , after the first row. (after every ros, in fact) to prepend a comma before the 2nd and following records. This ensures that the last record which should be a newline will not get a trailing ,.

The version below of the GNU Awk script will work as expected only if the last line of the input file is not terminated with a newline. It will create an additional line with a , because the last record containing a newline will be terminated by the output record separator ,.

awk 'BEGIN { RS=ORS="," } { print gensub(/(\n[ \t#]*)&/, " \\&\\1 ",1) }' inputfile

If the input file ends with a newline, the output will be

...
      I (, &
      J +,
      K ?,
,

with no newline after the last ,.

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163342

Using sed

sed -En 'H;${g;s/^\n//;s/((\n *#.*)*)\n +&(.*)/ \&\1\n    \3/gmp}' file

Explanation

  • -E Enable extended regexp
  • -n Prevent the default printing of sed
  • H Append to hold space
  • ${ When at the end
  • g Overwrite what is in the hold space to the pattern space
  • s/^\n//; remove the leading newline from the hold space
  • s/ Start substitute
  • ((\n *#.*)*) Capture group 1, optionally repeat matching a newline and # followed by the rest of the line
  • \n +&(.*) Match a newline and 1+ spaces, then match & and capture the rest of the line in group 3
  • / Substitute with after this
  • \&\1\n \3 The substitution pattern with the capture groups and the escaped &
  • / End substitution
  • gmp global to replace all occurrences, multiline, print the line that has a substitution

Output

      A +, &
      B -, &
      C ), &
      D +, &
      E (, &
      F *, &
 # &  G -,
      H +, &
      I (, &
      J +,
      K ?,%

See a bash demo.

Upvotes: 1

Renaud Pacalet
Renaud Pacalet

Reputation: 29050

If you use GNU sed and your file does not contain NUL characters (ASCII code 0), you can use its -z option to process the whole file as one single string, and the multi-line mode of the substitute command (m flag). The m flag is not absolutely needed but it simplifies a bit (. and character classes do not match newlines):

$ sed -Ez ':a;s/((\`|\n)[^#]*,)((\n.*#.*)*)(\n[[:blank:]]*)&/\1 \&\3\5 /gm;ta' file
      A +, &
      B -, &
      C ), &
      D +, &
      E (, &
      F *, &
 # &  G -,
      H +, &
      I (, &
      J +,
      K ?,

This corresponds to your textual specification and to your desired output for the example you show. But it is a bit complicated. Instead of processing lines that end with a newline character it processes sub-strings that begin with a newline character (or the beginning of the file) and end before the next newline character. Let's name these "chunks".

We search for a sequence of chunks in the form AB*C where:

  • A is a chunk (possibly the first) not containing #. It is matched by (\<backtick>|\n)[^#]*, which means beginning-of-file-or-newline, followed by any number of characters except newline and #, followed by a comma.
  • B* is any number (including none) of chunks containing #. It is matched by \n.*#.* which means newline, followed by any number of characters except newline, followed by # and any number of characters except newline.
  • C is a chunk starting with a newline, followed by spaces and &. It is matched by \n[[:blank:]]*& which means newline, followed by any number of blanks and a &.

If we find such a AB*C sequence we add a space and a & at the end of A, we do not change B*, and we replace the first & in C by a space. And we repeat until no such sequence is found.

Note: if the commas can be followed by blanks before the newline we must take them into account. If you want to keep them:

$ sed -Ez ':a;s/((\`|\n)[^#]*,[[:blank:]]*)((\n.*#.*)*)(\n[[:blank:]]*)&/\1 \&\3\5 /gm;ta' file

Else:

$ sed -Ez ':a;s/((\`|\n)[^#]*,)[[:blank:]]*((\n.*#.*)*)(\n[[:blank:]]*)&/\1 \&\3\5 /gm;ta' file

Upvotes: 1

sseLtaH
sseLtaH

Reputation: 11227

Using sed

$ sed ':a;N;s/\n \+\(&\) \(.*\)/ \1\n     \2/;ba' input_file
      A +, &
      B -, &
      C ), &
      D +, &
      E (, &
      F *,
 # &  G -, &
      H +, &
      I (, &
      J +,

Upvotes: 2

Related Questions