Gert Gottschalk
Gert Gottschalk

Reputation: 439

Replacing a multi character pattern that includes a newline with some characters

I have a file that has newlines and then some line extension that I need to unwrap.

Example:

X123
+ a b c
+ d e f g
Y4567
+ a1 b2
+ c1 d2
+ e1 f2

Expected:

X123 a b c d e f g
Y4567 a1 b2 c1 d2 e1 f2

I tried : perl -00pe 's/\n\+ / /g'

But it gave a failure:

Substitution loop at -e line 1, <> chunk 1.
11.715u 18.455s 0:33.14 91.0%   0+0k 13426056+0io 155pf+0w

Upvotes: 10

Views: 702

Answers (5)

user3408541
user3408541

Reputation: 71

The problem with this solution

perl -00pe 's/\n\+ / /g'

Was simply that you didnt include the input file on the command line. Create an input file like this...

$ more combine.lines.txt
X123
+ a b c
+ d e f g
Y4567
+ a1 b2
+ c1 d2
+ e1 f2
here is a test
+ this is
+ a test
here is another test
+ this is
+ another
+ test
original line with no plus lines

Then run

$ perl -00pe 's/\n\+ / /g' combine.lines.txt
X123 a b c d e f g
Y4567 a1 b2 c1 d2 e1 f2
here is a test this is a test
here is another test this is another test
original line with no plus lines

Looks to be working. However I get the feeling that you didnt write that code yourself, and dont really understand what it does. It is a slightly tricky solution that works by removing the + and the \n. Because of the file structure, the + lines will line up behind the non plus lines.

It may be useful to see this done manually. I have done the same thing reading the file line by line. I appended each + line to the preceding original line. Here is the code...

#!/usr/bin/perl -w

my $combinedLine=""; #combine the original line and + lines here
while(<>){
  chomp; #remove the newline
  if(/^[^+]/){
    print "\n$combinedLine" if($combinedLine=~/./); #when you see a non-plus line, print a
                                                    #newline and the last non-plus line,
                                                    #often referred to as a buffer
    $combinedLine = $_; #start a new combined line
  }else{                 #plus line found
    s/^\+//;              #remove the plus
    $combinedLine .= $_;   #append to the original line
  }
}
print "\n$combinedLine";    #print the last line here

Output is the same as above...

$ perl combine.lines.pl combine.lines.txt

X123 a b c d e f g
Y4567 a1 b2 c1 d2 e1 f2
here is a test this is a test
here is another test this is another test
original line with no plus lines

Upvotes: -1

ikegami
ikegami

Reputation: 386541

You operated on a string that is more than 231 chars in length, which is longer than the regex engine can handle. To handle strings that long, upgrade to Perl 5.22 or higher.

perl5220delta:

s///g now works on very long strings (where there are more than 2 billion iterations) instead of dying with 'Substitution loop'. [GH #11742]. [GH #14190].

Alternatively, you could mess with the line terminator.

perl -pe'BEGIN { $/ = "\n+ " } s/\n\+ \z/ /'

Depending on how many lines start with the sequence, this risks not fixing the problem. So you could use a solution which doesn't read more than one line at a time.

perl -ne'
   chomp;
   print "\n" if !s/^\+ / / && $. != 1;
   print;
   END { print "\n"; }
'

Same idea, but shortened at the cost of readability:

perl -pe'print $l if !s/^\+ / /; $l = chop; END { print $l }'

Upvotes: 11

jhnc
jhnc

Reputation: 16819

Borrowing @TLP's solution, with gawk which allows RS to be a regex (standard awk doesn't):

gawk 1 RS='\n[+]' ORS= file

As @ikegami notes, this may not do the right thing if you have input like:

X123
+ a b c
+d e f g

that should become

X123 a b c
+d e f g

Upvotes: 4

dawg
dawg

Reputation: 104082

Given your input example, here is an awk:

awk '/^[^+]/{if (s) print s; s=$0; next} 
            {sub(/^\+/,""); s=s $0} 
     END{print s}' file

Or another awk:

awk 'sub(/^\+/,"")==0 && FNR>1 {print ""} {printf} END{print ""}' file

Or a Ruby:

ruby -ne 'chomp
puts if !$_.sub!(/^\+\s*/," ") && $. > 1
print $_ + ($<.eof? ? "\n" : "")' file

Any of those prints:

X123 a b c d e f g
Y4567 a1 b2 c1 d2 e1 f2

Upvotes: 5

TLP
TLP

Reputation: 67910

If you want a line-by-line version, you could change the input record separator to \n+ and then remove that with chomp. It would in effect just delete those characters from the file with a normal -p one-liner. I.e.:

$ perl -pe'BEGIN{$/="\n+"}; chomp;' file.txt

The process is that it reads a "line" that ends with newline and a plus and puts that in $_, then chomp removes that ending, and the line is printed.

Upvotes: 6

Related Questions