pravek
pravek

Reputation: 21

Delete nodes from xml file using sed or awk

I am looking to remove "error_mail" and "succeed_mail" nodes from multiple similar XML files using sed or awk utilities .

Using sed , I was trying with below command ..but its not working

sed -i /<action name="succeed_mail">/,/<\/action>/d *.xml

Here is the sample file (test.xml) looks as below:-

Input XML File :- test.xml

 <workflow>
    <action name="start"
    -----
    -----
       </action>
    
    <action name="error_mail">
            <email xmlns="uri:oozie:email-action:0.1">
              <to>[email protected]</to>
              <cc>[email protected]</cc>
              <subject>Batch Failed</subject>
              <body>Batch Failed at ${node}</body>
            </email>
            <ok to="killjob"/>
            <error to="killjob"/>
          </action>
        <action name="succeed_mail">
            <email xmlns="uri:oozie:email-action:0.1">
              <to>[email protected]</to>
              <cc>[email protected]</cc>
              <subject>Batch Succeed</subject>
              <body>Batch completed</body>
            </email>
            <ok to="end"/>
            <error to="end"/>
          </action></r>
    </workflow>

--------Desired output :-

test.xml
<workflow>
<action name="start"
-----
-----
   </action>
</workflow>

Upvotes: 1

Views: 312

Answers (3)

Richard
Richard

Reputation: 1

Had a similar need. My process:

  1. convert xml to a single line.
  2. convert <tag> to </tag> in a new line of its own
  3. grep -v tag (or string as desired )
  4. xmllint --format
  5. qed

This method is quite generic. To convert xml to a single line: tr -d '\n' Csh script for step 2, accepts xml from piped stdin

>cat xmlsinglenewline
#!/bin/csh -f
# $1 is the tag
# Usage: <command>  "tag"
sed "s/<$1/\n\<$1/g" | sed "s/<\/$1>/\<\/$1\>\n/g"

Caveat: Cannot handle nested (same) tag.

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203189

You didn't tell us in what way "it's not working" so I'm assuming you either don't know how to use | in a regexp or don't know you have to quote your scripts.

With a sed that has -E to enable EREs:

$ sed -E '/<action name="(succeed|error)_mail">/,/<\/action>/d' file
 <workflow>
    <action name="start"
    -----
    -----
       </action>

    </workflow>

or with any awk:

$ awk '/<action name="(succeed|error)_mail">/{f=1} !f; /<\/action>/{f=0}' file
 <workflow>
    <action name="start"
    -----
    -----
       </action>

    </workflow>

That is, of course, fragile and will fail for various other layouts of the same XML which is why use of XML-aware tools is always advised.

Upvotes: 0

RavinderSingh13
RavinderSingh13

Reputation: 133428

Experts always advice to use tools like xmlstarlet to parse xml files, since OP is using sed so coming up with this awk solution. Fair warning this is written as per shown samples ONLY, in case you have something different this may not work.

awk '
/^ +<\/action>/ && foundSuccess{
  foundSuccess=""
  next
}
/^ +<\/action>/ && foundError{
  foundError=""
  next
}
/^ +<action name="error_mail">$/{
  foundError=1
}
/^ +<action name="succeed_mail">/{
  foundSuccess=1
}
NF && !foundError && !foundSuccess
' Input_file

Explanation: Adding detailed explanation for above.

awk '                              ##Starting awk program from here.
/^ +<\/action>/ && foundSuccess{   ##Checking if line has </action> and variable foundSuccess is SET then do following.
  foundSuccess=""                  ##Nullify variable foundSuccess here.
  next                             ##next will skip all further statements from here.
}
/^ +<\/action>/ && foundError{     ##Checking if line has </action> and variable foundError is SET then do following.
  foundError=""                    ##Nullify variable foundError here.
  next                             ##next will skip all further statements from here.
}
/^ +<action name="error_mail">$/{  ##Checking if line starts with space and have <action name="error_mail">
  foundError=1                     ##Setting variable foundError to 1 here.
}
/^ +<action name="succeed_mail">/{ ##Checking if line starts with space and have <action name="succeed_mail">
  foundSuccess=1                   ##Setting variable foundSuccess to 1 here.
}
NF && !foundError && !foundSuccess ##Checking if line is NOT empty AND variable foundError AND variable foundSuccess is NOT set then print that line.
' Input_file                       ##Mentioning Input_file name here.

NOTE: To pass multiple xml files in place of Input_file use *.xml to it, but this will not in place save. To perform in place save use GNU awk, change awk to awk -i inplace in above code. But its better to test it on few files and then run inplace option please for safer side. You could see this link how to do inplace editing with awk with a backup of Input_file too https://stackoverflow.com/a/16529730/5866580

Upvotes: 0

Related Questions