Reputation: 1464
Well, using sed
I'm trying to extract everything between <Transport_key>
and </Transport_key>
from input files like this:
<?xml version="1.0" encoding="utf-8"?>
<Envelope xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<Header>
<Security>
<Transport_key>
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>
</Transport_key>
</Security>
</Header>
<Body>
</Body>
</Envelope>
so i want to get
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>
regardless of any optional newlines between elements. I just want the text between the two strings unmodified, even if the input is a single big line.
I tried with
sed -e "s@.*<Transport_key>\(.*\)</Transport_key>.*@\1@" test.txt
but in the meantime I learned, that sed
is taking inputs line per line and it cannot work.
Is there a solution for that?
Upvotes: 1
Views: 3764
Reputation: 203209
The simplest solution to this particular problem that's independent of white space is to use GNU awk for multi-char RS:
$ gawk -v RS='\\s*</?Transport_key>\\s*' 'NR==2' file
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>
$ tr -d '\n' < file
<?xml version="1.0" encoding="utf-8"?><Envelope xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><Header><Security><Transport_key><EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#"><EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" /><CipherData><CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue></CipherData></EncryptedKey></Transport_key></Security></Header><Body></Body></Envelope>
$ tr -d '\n' < file | gawk -v RS='\\s*</?Transport_key>\\s*' 'NR==2'
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#"><EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" /><CipherData><CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue></CipherData></EncryptedKey>
The reason to use an XML parser, though, is to handle things like the tag value showing up inside a string, etc. properly.
Upvotes: 0
Reputation: 6110
Via sed, you can try the following :
sed -n '/<Transport_key>/,/<\/Transport_key>/p' test1.xml | sed -e '/Transport_key/d'
The first command takes everything between the Transport_key tags. Since this also prints the Transport_key tags, the second command deletes the lines containing the Transport_key tags.
Upvotes: 0
Reputation: 92854
For your "last try without such ...", grep approach:
grep -Poz '<Transport_key>\s*\K[\s\S]*(?=</Transport_key>)' test.txt
The output:
<EncryptedKey Id="TK" xmlns="http://www.w3.org/2001/04/xmlenc#">
<EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p" />
<CipherData>
<CipherValue>pifKajuAK8FKwqLEhKIP4x5V5XUQyrwhpA</CipherValue>
</CipherData>
</EncryptedKey>
For your further proper tries, xmlstarlet approach:
xmlstarlet sel -t -c '//Transport_key/*' -n test.txt
Upvotes: 2
Reputation: 19305
It would be safier to use an xml parser but for some cases it can also be done with regex.
perl -0777 -ne 'print for m@<EncryptedKey(?!</EncryptedKey).*</EncryptedKey>@gs' <test.txt
from perl -h
modifiers
.
matches \n
regex:
Upvotes: 0