Vano
Vano

Reputation: 1538

replace quotes in substring (preferably without external dependencies)

I'm trying to sanitize quotes(") in text file by replacing them with \" or an empty string. The replacing must occur in substring delimited by TOKEN placeholders.

Example

# input (example.txt):
Line title "A" - TOKEN some line with "quotation marks"
Line title "B" - TOKEN some line with "another quotation marks"
Line title "C" - TOKEN some "line" TOKEN more "text"
Random "line"


# result (example.txt)
Line title "A" - TOKEN some line with \"quotation marks\"
Line title "B" - TOKEN some line with \"another quotation marks\"
Line title "C" - TOKEN some \"line\" TOKEN more "text"
Random "line"


# Another option
# result (example.txt)
Line title "A" - TOKEN some line with quotation marks
Line title "B" - TOKEN some line with another quotation marks
Line title "C" - TOKEN some line TOKEN more "text"
Random "line"

Preferably without external dependencies(i.e Python,JS) on Linux, so probably sed, awk, bash are best

PS - What I've tried so far is:

sed -iE "s/TOKEN(.+)(\")(.+).*\TOKEN\1\3/g" /tmp/test

But it handles only a single replacement per line

EDIT: (sorry about late addition after many answers)

Upvotes: 2

Views: 148

Answers (5)

Ed Morton
Ed Morton

Reputation: 204258

Assuming that:

  1. you want TOKEN treated as a regular expression (or if not will escape metachars in advance of using it),
  2. a line where TOKEN doesn't occur should be left unchanged, and
  3. TOKEN matches even if it's in the middle of another string

then using any awk in any shell on every Unix box:

$ awk '
match($0,/TOKEN.*TOKEN/) || match($0,/TOKEN.*/) {
    tgt = substr($0,RSTART,RLENGTH)
    gsub(/"/, "\\\"", tgt)
    $0 = substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
1' example.txt
Line title "A" - TOKEN some line with \"quotation marks\"
Line title "B" - TOKEN some line with \"another quotation marks\"
Line title "C" - TOKEN some \"line\" TOKEN more "text"
Random "line"

Upvotes: 3

Paul Hodges
Paul Hodges

Reputation: 15418

Divide and conquer in sed -

$: sed '/TOKEN/{ h; s/TOKEN.*//; x; s/^.*TOKEN//; s/"/\\"/g; H; x; s/\n/TOKEN/; }' file
# input (example.txt):
Line title "A" - TOKEN some line with \"quotation marks\"
Line title "B" - TOKEN some line with \"another quotation marks\"

Explanation:

/TOKEN/{ ...} - This acts only on lines with the TOKEN
h; - this places a copy of the line in the hold buffer
s/TOKEN.*//; - this removes from TOKEN through the end of the pattern buffer copy
x; - this exchanges the pattern and hold buffers, placing the abbreviated beginning in the hold buffer and the complete record in the pattern buffer
s/^.*TOKEN//; - eliminate the part you do NOT want changed from the pattern buffer copy
s/"/\\"/g; - backslash-quote the double-quotes characters remaining in the pattern buffer copy; use s/"//g to just remove them
H; this appends a newline to the hold buffer, then adds the pattern buffer copy as another line in the hold buffer
x; - this switched the entire hold buffer back to the pattern buffer
s/\n/TOKEN/; - this replaces the newline with TOKEN

In English:

  1. make a copy.
  2. chop the part you plan to edit off the original.
  3. swap to save the original beginning.
  4. chop the beginning off the part you plan to edit.
  5. backslash or remove the quotes in the part you want edited.
  6. stack the edited end onto the original beginning.
  7. swap them back in and paste them back onto one line with the TOKEN between.

Upvotes: 1

Sundeep
Sundeep

Reputation: 23677

Here's another perl solution. This deletes all double quotes only if TOKEN doesn't occur later in the input line. Use perl -i -pe for in-place modification.

$ perl -pe 's/"(?!.*TOKEN)//g' ip.txt
Line title "A" - TOKEN some line with quotation marks
Line title "B" - TOKEN some line with another quotation marks

If there can be lines with double quotes not containing TOKEN and such quotes shouldn't be changed, use perl -pe 's/"(?!.*TOKEN)//g if /TOKEN/'


Here's an awk solution. Input is split using TOKEN as the field delimiters and then the substitution is perfomed on the second field. Lines not containing TOKEN won't be modified.

$ awk 'BEGIN{FS=OFS="TOKEN"} {gsub(/"/, "", $2)} 1' ip.txt
Line title "A" - TOKEN some line with quotation marks
Line title "B" - TOKEN some line with another quotation marks

Upvotes: 2

choroba
choroba

Reputation: 242038

Perl to the rescue!

perl -pe 's/TOKEN\K(.*)/$1 =~ s|"|\\"|gr/e' -- example.txt

It's a substitution inside a substitution.

The outer substitution looks like this:

s/TOKEN\K(.*)/.../e

Which replaces everything after TOKEN with the ... part. The /e means the ... part is evaluated as code.

The replacement code is $1 =~ s|"|\\"|gr. It substitutes all " with \" in the contents of $1, i.e. the part matched by the outer substitution, and returns the result (that's what the /r does).

To remove the double quotes instead of escaping, just delete the \\" part.

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627219

Removing all double quotation marks after TOKEN substring with sed can be done with

sed -i -E ':A; s/(TOKEN[^"]*)"/\1/g; tA' /tmp/test

Replacing " with \" after TOKEN is also possible:

sed -i -E ':A; s/(TOKEN[^\\"]*(\\.[^\\"]*)*)"/\1\\"/g; tA' /tmp/test

Details:

  • :A - sets a label A
  • s/(TOKEN[^"]*)"/\1/g - finds all occurrences of TOKEN, zero or more chars other thab " (captured into Group 1) and then matches a ", and replaces the match with Group 1 value (the version with [^\\"]*(\\.[^\\"]*)* matches all escaped chars together with any chars other than double quotation marks, and \1\\" replacement puts back Group 1 value + an escaped ")
  • tA - goes back to label A upon successful replacement.

See the online demo:

#!/bin/bash
s='Line title "A" - TOKEN some line with "quotation marks"'
sed -E ':A; s/(TOKEN[^"]*)"/\1/g; tA' <<< "$s"
# => Line title "A" - TOKEN some line with quotation marks
sed -E ':A; s/(TOKEN[^\\"]*(\\.[^\\"]*)*)"/\1\\"/g; tA' <<< "$s"
# => Line title "A" - TOKEN some line with \"quotation marks\"

Upvotes: 3

Related Questions