MichaelW
MichaelW

Reputation: 1464

Using sed to extract element content of an XML file

Well, using sed I'm trying to extract everything between <Transport_key> and </Transport_key> from input files like this:

<?xml version="1.0" encoding="utf-8"?>
<Envelope xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<Header>
<Security>
<Transport_key>
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>
</Transport_key>
</Security>
</Header>
<Body>
</Body>
</Envelope>

so i want to get

<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>

regardless of any optional newlines between elements. I just want the text between the two strings unmodified, even if the input is a single big line.

I tried with

sed -e "s@.*<Transport_key>\(.*\)</Transport_key>.*@\1@" test.txt

but in the meantime I learned, that sed is taking inputs line per line and it cannot work.

Is there a solution for that?

Upvotes: 1

Views: 3764

Answers (4)

Ed Morton
Ed Morton

Reputation: 203209

The simplest solution to this particular problem that's independent of white space is to use GNU awk for multi-char RS:

$ gawk -v RS='\\s*</?Transport_key>\\s*' 'NR==2' file
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>

$ tr -d '\n' < file
<?xml version="1.0" encoding="utf-8"?><Envelope xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><Header><Security><Transport_key><EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#"><EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" /><CipherData><CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue></CipherData></EncryptedKey></Transport_key></Security></Header><Body></Body></Envelope>

$ tr -d '\n' < file | gawk -v RS='\\s*</?Transport_key>\\s*' 'NR==2'
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#"><EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" /><CipherData><CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue></CipherData></EncryptedKey>

The reason to use an XML parser, though, is to handle things like the tag value showing up inside a string, etc. properly.

Upvotes: 0

souser
souser

Reputation: 6110

Via sed, you can try the following :

sed -n '/<Transport_key>/,/<\/Transport_key>/p' test1.xml | sed -e '/Transport_key/d'

The first command takes everything between the Transport_key tags. Since this also prints the Transport_key tags, the second command deletes the lines containing the Transport_key tags.

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

For your "last try without such ...", grep approach:

grep -Poz '<Transport_key>\s*\K[\s\S]*(?=</Transport_key>)' test.txt

The output:

<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>

For your further proper tries, xmlstarlet approach:

xmlstarlet sel -t -c '//Transport_key/*' -n test.txt

Upvotes: 2

Nahuel Fouilleul
Nahuel Fouilleul

Reputation: 19305

It would be safier to use an xml parser but for some cases it can also be done with regex.

perl -0777 -ne 'print for m@<EncryptedKey(?!</EncryptedKey).*</EncryptedKey>@gs' <test.txt

from perl -h

  • -0777 : specify record separator (octal, 777 is undef <=> read all file)
  • -n : assume "while (<>) { ... }" loop around program

modifiers

  • g: all matches
  • s: . matches \n

regex:

  • (?!..): negative look-ahead

Upvotes: 0

Related Questions