hommel
hommel

Reputation: 55

Replace text between pattern range on same line

This may be a better task for awk than sed, but the goal is to parse a single, long string (it happens to be an XML doc) and replace text within a pattern range with another character.

I want to preserve the number of characters being replaced and simply mask them as asterisks. I've put something together in a python script to parse the XML tree but have a feeling a native program is going to be much faster.

Assuming the string: "<mask>123</mask><keep>123</keep>"

...I'd like the output: "<mask>***</mask><keep>123</keep>"

My first attempt with sed without using ranges got me this:

$ echo "<mask>123</mask><keep>123</keep>" | sed "s/[0-9]/*/g"
<mask>***</mask><keep>***</keep>

I learned that sed can operate within ranges, but my understanding is that the behavior can only be toggled from line-to-line, not over the course of processing a single line.

Experimenting with pattern ranges got me the following (consistent with my understanding) and thus didn't work either:

$ echo "<mask>123</mask><keep>123</keep>" | sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g" 
<mask>***</mask><keep>***</keep>

EDIT: In fact, even if there were line breaks in the input, I must not be understanding the pattern range behavior correctly (or my example is poorly constructed)

$ echo "<mask>123</mask>\n<keep>123</keep>" | sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g" 
<mask>***</mask>
<keep>***</keep>

Any tips would be greatly appreciated.

Upvotes: 3

Views: 464

Answers (2)

Ed Morton
Ed Morton

Reputation: 204005

Never use range expressions as they make simple tasks very slightly briefer but then need a complete rewrite or duplicate conditions when your requirements become marginally more interesting, always use a flag variable instead if a range is necessary. What that means, of course, is that you can't use sed for problems like this since it doesn't support variables.

Anyway, here's a trivial GNU awk (for multi-char RS and RT) solution that doesn't directly use ranges at all:

$ cat file
Assuming the string: "<mask>123</mask><keep>123</keep>" ...I'd like the

$ awk -v RS='</mask>' -v ORS= '{print gensub(/(.*<mask>).*/,"\\1***",1) RT}' file
Assuming the string: "<mask>***</mask><keep>123</keep>" ...I'd like the

or if you need the number of *s to match the number of characters they're replacing:

$ cat file
Assuming  first string: "<mask>123</mask><keep>123</keep>" ...I'd like the
Assuming second string: "<mask>1234567</mask><keep>123</keep>" ...I'd like the

$ awk -v RS='</mask>' 'match($0,/(.*<mask>)(.*)/,a){ $0=a[1] gensub(/./,"*","g",a[2]) } {ORS=RT} 1' file
Assuming  first string: "<mask>***</mask><keep>123</keep>" ...I'd like the
Assuming second string: "<mask>*******</mask><keep>123</keep>" ...I'd like the

Upvotes: 3

Kent
Kent

Reputation: 195179

why you got this output is completely correct. It is a trick of sed's range address of two regex.

What you gave sed is /regex1/, /regex2/, sed will first try to find the line matches address1, which is /regex1/, the first line matched, fine. Then your address2 is a regex too, so:

and if addr2 is a regexp, it will not be tested against the line that addr1 matched.

This sentence is from sed's man page.

That is, sed starts checking your /regex2/ from line 2. of course, no line matches the /<\/mask>/, so sed just did the substitution on whole file.

Check this example:

kent$  cat f
<mask>234</mask>
123
123
123
<mask>234</mask>
123
123
<keep>234</keep>

kent$  sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g" f
<mask>***</mask>
***
***
***
<mask>***</mask>
123
123
<keep>234</keep>

Finally just a suggestion, don't process xml with regex (sed/awk/grep...). Of course, you may just use the "xml" as an example.

Upvotes: 2

Related Questions