Reputation: 55
This may be a better task for awk
than sed
, but the goal is to parse a single, long string (it happens to be an XML doc) and replace text within a pattern range with another character.
I want to preserve the number of characters being replaced and simply mask them as asterisks. I've put something together in a python script to parse the XML tree but have a feeling a native program is going to be much faster.
Assuming the string: "<mask>123</mask><keep>123</keep>"
...I'd like the output: "<mask>***</mask><keep>123</keep>"
My first attempt with sed
without using ranges got me this:
$ echo "<mask>123</mask><keep>123</keep>" | sed "s/[0-9]/*/g"
<mask>***</mask><keep>***</keep>
I learned that sed
can operate within ranges, but my understanding is that the behavior can only be toggled from line-to-line, not over the course of processing a single line.
Experimenting with pattern ranges got me the following (consistent with my understanding) and thus didn't work either:
$ echo "<mask>123</mask><keep>123</keep>" | sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g"
<mask>***</mask><keep>***</keep>
EDIT: In fact, even if there were line breaks in the input, I must not be understanding the pattern range behavior correctly (or my example is poorly constructed)
$ echo "<mask>123</mask>\n<keep>123</keep>" | sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g"
<mask>***</mask>
<keep>***</keep>
Any tips would be greatly appreciated.
Upvotes: 3
Views: 464
Reputation: 204005
Never use range expressions as they make simple tasks very slightly briefer but then need a complete rewrite or duplicate conditions when your requirements become marginally more interesting, always use a flag variable instead if a range is necessary. What that means, of course, is that you can't use sed for problems like this since it doesn't support variables.
Anyway, here's a trivial GNU awk (for multi-char RS and RT) solution that doesn't directly use ranges at all:
$ cat file
Assuming the string: "<mask>123</mask><keep>123</keep>" ...I'd like the
$ awk -v RS='</mask>' -v ORS= '{print gensub(/(.*<mask>).*/,"\\1***",1) RT}' file
Assuming the string: "<mask>***</mask><keep>123</keep>" ...I'd like the
or if you need the number of *
s to match the number of characters they're replacing:
$ cat file
Assuming first string: "<mask>123</mask><keep>123</keep>" ...I'd like the
Assuming second string: "<mask>1234567</mask><keep>123</keep>" ...I'd like the
$ awk -v RS='</mask>' 'match($0,/(.*<mask>)(.*)/,a){ $0=a[1] gensub(/./,"*","g",a[2]) } {ORS=RT} 1' file
Assuming first string: "<mask>***</mask><keep>123</keep>" ...I'd like the
Assuming second string: "<mask>*******</mask><keep>123</keep>" ...I'd like the
Upvotes: 3
Reputation: 195179
why you got this output is completely correct. It is a trick of sed's range address of two regex.
What you gave sed is /regex1/, /regex2/
, sed will first try to find the line matches address1
, which is /regex1/
, the first line matched, fine. Then your address2
is a regex too, so:
and if addr2 is a regexp, it will not be tested against the line that addr1 matched.
This sentence is from sed's man page.
That is, sed starts checking your /regex2/
from line 2. of course, no line matches the /<\/mask>/
, so sed just did the substitution on whole file.
Check this example:
kent$ cat f
<mask>234</mask>
123
123
123
<mask>234</mask>
123
123
<keep>234</keep>
kent$ sed "/<mask>/,/<\/mask>/ s/[0-9]/*/g" f
<mask>***</mask>
***
***
***
<mask>***</mask>
123
123
<keep>234</keep>
Finally just a suggestion, don't process xml with regex (sed/awk/grep...). Of course, you may just use the "xml" as an example.
Upvotes: 2