Reputation: 1538
I'm trying to sanitize quotes("
) in text file by replacing them with \"
or an empty string. The replacing must occur in substring delimited by TOKEN placeholders.
Example
# input (example.txt):
Line title "A" - TOKEN some line with "quotation marks"
Line title "B" - TOKEN some line with "another quotation marks"
Line title "C" - TOKEN some "line" TOKEN more "text"
Random "line"
# result (example.txt)
Line title "A" - TOKEN some line with \"quotation marks\"
Line title "B" - TOKEN some line with \"another quotation marks\"
Line title "C" - TOKEN some \"line\" TOKEN more "text"
Random "line"
# Another option
# result (example.txt)
Line title "A" - TOKEN some line with quotation marks
Line title "B" - TOKEN some line with another quotation marks
Line title "C" - TOKEN some line TOKEN more "text"
Random "line"
Preferably without external dependencies(i.e Python,JS) on Linux, so probably sed, awk, bash
are best
PS - What I've tried so far is:
sed -iE "s/TOKEN(.+)(\")(.+).*\TOKEN\1\3/g" /tmp/test
But it handles only a single replacement per line
EDIT:
(sorry about late addition after many answers)
Upvotes: 2
Views: 148
Reputation: 204258
Assuming that:
TOKEN
treated as a regular expression (or if not will escape metachars in advance of using it),TOKEN
doesn't occur should be left unchanged, andTOKEN
matches even if it's in the middle of another stringthen using any awk in any shell on every Unix box:
$ awk '
match($0,/TOKEN.*TOKEN/) || match($0,/TOKEN.*/) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/"/, "\\\"", tgt)
$0 = substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
1' example.txt
Line title "A" - TOKEN some line with \"quotation marks\"
Line title "B" - TOKEN some line with \"another quotation marks\"
Line title "C" - TOKEN some \"line\" TOKEN more "text"
Random "line"
Upvotes: 3
Reputation: 15418
Divide and conquer in sed
-
$: sed '/TOKEN/{ h; s/TOKEN.*//; x; s/^.*TOKEN//; s/"/\\"/g; H; x; s/\n/TOKEN/; }' file
# input (example.txt):
Line title "A" - TOKEN some line with \"quotation marks\"
Line title "B" - TOKEN some line with \"another quotation marks\"
Explanation:
/TOKEN/{ ...}
- This acts only on lines with the TOKEN
h;
- this places a copy of the line in the h
old buffer
s/TOKEN.*//;
- this removes from TOKEN through the end of the pattern buffer copy
x;
- this ex
changes the pattern and hold buffers, placing the abbreviated beginning in the hold buffer and the complete record in the pattern buffer
s/^.*TOKEN//;
- eliminate the part you do NOT want changed from the pattern buffer copy
s/"/\\"/g;
- backslash-quote the double-quotes characters remaining in the pattern buffer copy; use s/"//g
to just remove them
H;
this appends a newline to the hold buffer, then adds the pattern buffer copy as another line in the hold buffer
x;
- this switched the entire hold buffer back to the pattern buffer
s/\n/TOKEN/;
- this replaces the newline with TOKEN
In English:
Upvotes: 1
Reputation: 23677
Here's another perl
solution. This deletes all double quotes only if TOKEN
doesn't occur later in the input line. Use perl -i -pe
for in-place modification.
$ perl -pe 's/"(?!.*TOKEN)//g' ip.txt
Line title "A" - TOKEN some line with quotation marks
Line title "B" - TOKEN some line with another quotation marks
If there can be lines with double quotes not containing TOKEN
and such quotes shouldn't be changed, use perl -pe 's/"(?!.*TOKEN)//g if /TOKEN/'
Here's an awk
solution. Input is split using TOKEN
as the field delimiters and then the substitution is perfomed on the second field. Lines not containing TOKEN
won't be modified.
$ awk 'BEGIN{FS=OFS="TOKEN"} {gsub(/"/, "", $2)} 1' ip.txt
Line title "A" - TOKEN some line with quotation marks
Line title "B" - TOKEN some line with another quotation marks
Upvotes: 2
Reputation: 242038
Perl to the rescue!
perl -pe 's/TOKEN\K(.*)/$1 =~ s|"|\\"|gr/e' -- example.txt
It's a substitution inside a substitution.
The outer substitution looks like this:
s/TOKEN\K(.*)/.../e
Which replaces everything after TOKEN with the ...
part. The /e
means the ...
part is evaluated as code.
The replacement code is $1 =~ s|"|\\"|gr
. It substitutes all "
with \"
in the contents of $1
, i.e. the part matched by the outer substitution, and returns the result (that's what the /r
does).
To remove the double quotes instead of escaping, just delete the \\"
part.
Upvotes: 2
Reputation: 627219
Removing all double quotation marks after TOKEN
substring with sed
can be done with
sed -i -E ':A; s/(TOKEN[^"]*)"/\1/g; tA' /tmp/test
Replacing "
with \"
after TOKEN
is also possible:
sed -i -E ':A; s/(TOKEN[^\\"]*(\\.[^\\"]*)*)"/\1\\"/g; tA' /tmp/test
Details:
:A
- sets a label A
s/(TOKEN[^"]*)"/\1/g
- finds all occurrences of TOKEN
, zero or more chars other thab "
(captured into Group 1) and then matches a "
, and replaces the match with Group 1 value (the version with [^\\"]*(\\.[^\\"]*)*
matches all escaped chars together with any chars other than double quotation marks, and \1\\"
replacement puts back Group 1 value + an escaped "
)tA
- goes back to label A
upon successful replacement.See the online demo:
#!/bin/bash
s='Line title "A" - TOKEN some line with "quotation marks"'
sed -E ':A; s/(TOKEN[^"]*)"/\1/g; tA' <<< "$s"
# => Line title "A" - TOKEN some line with quotation marks
sed -E ':A; s/(TOKEN[^\\"]*(\\.[^\\"]*)*)"/\1\\"/g; tA' <<< "$s"
# => Line title "A" - TOKEN some line with \"quotation marks\"
Upvotes: 3