Dj Destram N
Dj Destram N

Reputation: 25

sed edit, delete xml tags

I'm newbie with great editor called - sed.

I want to delete all the xml tags and extract string between specific tag - reportBody

Here how is it looks like in a single line:

<?xml version="1.0" ?><SOAP- ENV:Envelope xmlns:SOAP-ENV="blablah"><SOAP-ENV:Body> <getReportResponsexmlns:msgns="blahblahblah" xmlns="blahblah"><returnxmlns=""> <returnCode><majorReturnCode>000</majorReturnCode><minorReturnCode>0000</minorReturnCode><returnCode><reportName>blahblah</reportName><reportTitle>blahblahblahr</reportTitle><reportBody>STRING TO EXTRACT</reportBody><reportMimeType>text/csv</reportMimeType></return></getReportResponse></SOAP-ENV:Body></SOAP-ENV:Envelope>

The problem is that xml file CAN be different, sometimes it's written in a single line either written in 2-3 lines or the string to extract will be stored on more than 1 line between reportBody tag. so it can be something like that or even different:

    <?xml version="1.0" ?><SOAP- ENV:Envelope xmlns:SOAP-ENV="blablah"><SOAP-ENV:Body> 
`enter code here`<getReportResponsexmlns:msgns="blahblahblah" xmlns="blahblah">
<returnxmlns=""> <returnCode>
<majorReturnCode>000</majorReturnCode><minorReturnCode>0000</minorReturnCode>
<returnCode>
<reportName>blahblah</reportName><reportTitle>blahblahblahr</reportTitle><reportBody>
STRING 
TO 
EXTRACT</reportBody>
<reportMimeType>text/csv</reportMimeType></return>
</getReportResponse></SOAP-ENV:Body></SOAP-ENV:Envelope>

What is the solution to deal with all the possible changes? Also, can I set parameters to save files and decode string to base64? Thanks !

Upvotes: 1

Views: 530

Answers (1)

anubhava
anubhava

Reputation: 785128

You can use this gnu-awk to extract it:

awk -v RS='<reportBody>.*</reportBody>' 'RT{print RT}' file.xml
<reportBody>
STRING
TO
EXTRACT</reportBody>

With first input you will get this output:

<reportBody>STRING TO EXTRACT</reportBody>

-v RS='<reportBody>.*</reportBody>' will set input record separator as any text from <reportBody> to </reportBody>

Use:

awk -v RS='<reportBody>.*</reportBody>' 'RT{
     gsub(/<\/?reportBody>[[:space:]]*/, "", RT); print RT}' file.xml

If you want to extract string inside the tags.

Upvotes: 1

Related Questions