Reputation: 5146
I want to remove (sed or awk) the newline on all the lines that contains just one time the character " but once the newline has been removed on the line it can be remove on the following line.
this is an example
line1"test 2015"
line2"test
2015"
line3"test 2020"
line4"test
2017"
should be transformed in:
line1"test 2015"
line2"test2015"
line3"test 2020"
line4"test2017"
Upvotes: 1
Views: 1738
Reputation: 88563
With sed:
sed '/[^"]$/{N;s/\n//}' file
Output:
line1"test 2015" line2"test2015" line3"test 2020" line4"test2017"
Search (//
) for lines not (^
) end ($
) with single character "
. Only for these lines ({}
): append next line (N
) to sed's pattern space (current line) and use sed's search and replace (s///
) to find in pattern space the now embedded newline (\n
) and replace by nothing.
Upvotes: 3
Reputation: 58371
This might work for you (GNU sed):
sed -r ':a;N;s/^([^\n"]*"[^\n"]*)\n/\1 /;ta;P;D' file
This replaces the newline between two lines with a space where the first line only contains one double quote.
N.B. the space may be removed as well but the data suggested it.
Upvotes: 0
Reputation: 44023
awk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, ""); } { printf("%s%s", $0, RT) }' filename
This is the most straightforward way. Using "
as the record separator,
NR % 2 == 0 { # in every other record (those inside quotes)
gsub(/\n/, "") # remove the newlines
}
{
printf("%s%s", $0, RT) # then print the line terminated by the same thing
# as in the input (to avoid an extra quote at the
# end of the output)
}
RT
is a GNU extension, that's why this requires gawk.
The difficulty in doing this with sed is the possibility that there may be two newlines between quotes, such as
line2"test
123
2015"
This makes fetching just a single line after the condition brittle. Therefore:
sed '/^[^"]*"[^"]*$/ { :a /\n.*"/! { N; ba; }; s/\n//g; }' filename
That is:
/^[^"]*"[^"]*$/ { # When a line contains only one quote
:a # jump label for looping
/\n.*"/! { # until there appears another quote
N # fetch more lines
ba
}
s/\n//g # once done, remove the newlines.
}
As a one-liner, this requires GNU sed because BSD sed is picky about the formatting of branch instructions. It should, however, be possible to put the expanded form of the code into a file, say foo.sed
, and run sed -f foo.sed filename
with BSD sed.
Note that this code assumes that after an opening quote, the next line with a quote in it contains only that one quote. A way around that problem, if required, is
sed ':a h; s/[^"]//g; s/""//g; /"/ { x; N; s/\n//; ba }; x' filename
...but that is arguably beyond the scope of things that should reasonably be done with sed. It works like this:
:a # jump label for looping
h # make a copy of the line
s/[^"]//g # isolate quotes
s/""//g # remove pairs of quotes
/"/ { # if there is a quote left (number of quotes is odd)
x # swap the unedited text back into the pattern space
N # fetch a new line
s/\n// # remove the newline between them
ba # loop
}
x # swap the text back in before printing.
The case of several quotes per line is easier to handle in awk than in sed. The GNU awk code above does it implicitly; for non-GNU awk it takes a little more doing (but not terribly so):
awk -F '"' '{ n = 0; line = ""; do { n += NF != 0 ? NF - 1 : 0; line = line $0 } while(n % 2 == 1 && getline == 1) print line }' filename
The main trick is to use "
as field separator so that the number of fields tells us how many quotes are in the line. Then:
{
# reset state
n = 0 # n is the number of quotes we have
# seen so far
line = "" # line is where we assemble the output
# line
do {
n += NF != 0 ? NF - 1 : 0; # add the number of quotes in the line
# (special handling for empty lines
# where NF == 0)
line = line $0 # append the line to the output
} while(n % 2 == 1 && getline == 1) # while the number of quotes is odd
# and there's more input, get new lines
# and loop
print line # once done, print the combined result.
}
Upvotes: 1