Reputation: 9796
I have a file where each line is a base64-encoded XML document. The decoded XML documents may contain new line characters. I would like to grep out each XML document containing a given word.
The problem is that, when I decode the lines of the file, I have multiple lines for each base64-encoded line and I cannot grep it any more. I need something like base64 decode + remove line breaks
in one step.
How can I achieve that in the Linux shell? I have Python, Perl and awk available.
>cat fileContainingBase64EncodedXMLsInEachLine.txt | what should I write here?
PGZvbz4NCjxiYXIvPg0KPC9mb28+
PGZvbz4NCjxodWh1Lz4NCjwvZm9vPg==
PGZvbz4NCjxiYXJvbWV0ZXIvPg0KPC9mb28+
Let's say I want the XML documents containing 'bar'
<foo>
<bar/>
</foo>
<foo>
<barometer/>
</foo>
>cat fileContainingBase64EncodedXMLsInEachLine.txt | base64 --decode | grep bar
Delivers:
<bar/>
<barometer/>
So I do not have the full xml documents containing bar
and barometer
.
Upvotes: 1
Views: 4238
Reputation: 3787
Perl to the rescue:
perl -MMIME::Base64 -nE '$_=decode_base64($_);/bar/&&say' fileContaining...txt
or
cat fileContaining...txt | perl -MMIME::Base64 -nE'$_=decode_base64($_);/bar/&&say'
Upvotes: 1
Reputation: 26551
update: if you know that the first node name is <foo>
, then you can just do :
$ echo "<head>$(base -decode <file>)</head>" | \
xmlstarlet sel -t -m '//bar/ancestor::foo' -c .
It selects the ancestor named foo
of the node called bar
, since foo
is the first xml-node, it will select the requested xml file.
original answer below:
Using xmlstarlet
you might want to do this
$ echo "<head>$(base -decode <file>)</head>" | \
xmlstarlet sel -t -m '//bar/ancestor::*[last()-1]' -c .
This essentially selects the full xml-tree of ancestors of the node 'bar' but it will only go upto the correct depth.
I added an extra head
node to make the full string a valid xml
file. This way you only need to print from the first node onwards.
The echo
would produce something like (slightly different version):
<head>
<foo />
<foo>
<barometer />
</foo>
<foo>
<DDD>
<BBB/>
<bar />
</DDD>
</foo>
</head>
xmlstarlet
will do a template selection based on the xpath //bar/ancestor::*
, leading to the following set of matches
<bar />
<DDD><BBB /><bar /></DDD>
<foo><DDD><BBB /><bar /></DDD></foo>
<head> everything </ head>
We are interested in the penultimate one, i.e. [last()-1]
and we ask to print a copy of it -c .
Upvotes: 1
Reputation: 55499
Here's some Python code that accepts a filename followed by the search word on the commandline. As usual, if either arg contains spaces, it must be quoted.
import sys
from base64 import b64decode
fname, pattern = sys.argv[1:]
with open(fname) as f:
for row in f:
row = b64decode(row).decode()
if pattern in row:
print(row, end='\n\n')
Running this on your data with "bar" as the pattern arg gives:
<foo>
<bar/>
</foo>
<foo>
<barometer/>
</foo>
In order to practice my rather rusty awk skills, I decided to write an awk command line to do this. It uses the standard base64
command to do the decoding.
awk 'BEGIN{cmd="base64 -d"}; {print |& cmd; close(cmd,"to"); z=""; while(cmd |& getline s) z=z s "\n"; close(cmd); if (z~pat)print z}' pat='bar' testdata_b64.txt
You pass it the pattern using the pat
argument, which can be a regex. You can send data to it via standard input, or you can give it one or more filenames on the commandline.
Note that regex patterns need double escaping, eg pat='\\<bar\\>'
matches the word bar
.
Upvotes: 3
Reputation: 1163
You can use tr
inside a loop to remove all new lines for each of the XML documents like this:
#!/bin/bash
while IFS='' read -r line
do
echo -n "$line" | base64 --decode | tr -d '\r\n'
echo
done < fileContainingBase64EncodedXMLsInEachLine.txt
Upvotes: 0
Reputation: 355
you can try the following python script. It is not a commandline onliner but this should give you what you want. For usage do:
>python3 get_xml.py SEARCHSTRING FILENAME
output for you example was:
<foo>
<bar/>
</foo>
<foo>
<barometer/>
</foo>
script:
import base64
import sys
script_name = sys.argv[0]
search_string = sys.argv[1]
filename = sys.argv[2]
print("[+] ({}) search for {}".format(script_name,search_string,filename))
with open(filename,"r") as xml_in:
nextline = xml_in.readline()
while nextline != '':
xml = base64.b64decode(nextline).decode("utf-8").rstrip()
if search_string in xml:
print(xml)
nextline = xml_in.readline()
Upvotes: 0