Gábor Lipták
Gábor Lipták

Reputation: 9796

Linux shell: Base64 Decode with removing line breaks

I have a file where each line is a base64-encoded XML document. The decoded XML documents may contain new line characters. I would like to grep out each XML document containing a given word.

The problem is that, when I decode the lines of the file, I have multiple lines for each base64-encoded line and I cannot grep it any more. I need something like base64 decode + remove line breaks in one step.

How can I achieve that in the Linux shell? I have Python, Perl and awk available.

>cat fileContainingBase64EncodedXMLsInEachLine.txt | what should I write here?

Input:

PGZvbz4NCjxiYXIvPg0KPC9mb28+
PGZvbz4NCjxodWh1Lz4NCjwvZm9vPg==
PGZvbz4NCjxiYXJvbWV0ZXIvPg0KPC9mb28+

Expected Output

Let's say I want the XML documents containing 'bar'

<foo>
<bar/>
</foo>
<foo>
<barometer/>
</foo>

An example for my problem

>cat fileContainingBase64EncodedXMLsInEachLine.txt | base64 --decode | grep bar

Delivers:

<bar/>
<barometer/>

So I do not have the full xml documents containing bar and barometer.

Upvotes: 1

Views: 4238

Answers (5)

Kjetil S.
Kjetil S.

Reputation: 3787

Perl to the rescue:

perl -MMIME::Base64 -nE '$_=decode_base64($_);/bar/&&say' fileContaining...txt

or

cat fileContaining...txt | perl -MMIME::Base64 -nE'$_=decode_base64($_);/bar/&&say'

Upvotes: 1

kvantour
kvantour

Reputation: 26551

update: if you know that the first node name is <foo>, then you can just do :

$ echo "<head>$(base -decode <file>)</head>" | \
  xmlstarlet sel -t -m '//bar/ancestor::foo' -c .

It selects the ancestor named foo of the node called bar, since foo is the first xml-node, it will select the requested xml file.

original answer below:

Using xmlstarlet you might want to do this

$ echo "<head>$(base -decode <file>)</head>" | \
  xmlstarlet sel -t -m '//bar/ancestor::*[last()-1]' -c .

This essentially selects the full xml-tree of ancestors of the node 'bar' but it will only go upto the correct depth.

I added an extra head node to make the full string a valid xml file. This way you only need to print from the first node onwards.

The echo would produce something like (slightly different version):

<head> 
  <foo /> 
  <foo> 
    <barometer /> 
  </foo> 
  <foo> 
    <DDD> 
      <BBB/> 
      <bar /> 
    </DDD> 
  </foo> 
</head>

xmlstarlet will do a template selection based on the xpath //bar/ancestor::*, leading to the following set of matches

  • <bar />
  • <DDD><BBB /><bar /></DDD>
  • <foo><DDD><BBB /><bar /></DDD></foo>
  • <head> everything </ head>

We are interested in the penultimate one, i.e. [last()-1] and we ask to print a copy of it -c .

Upvotes: 1

PM 2Ring
PM 2Ring

Reputation: 55499

Here's some Python code that accepts a filename followed by the search word on the commandline. As usual, if either arg contains spaces, it must be quoted.

import sys
from base64 import b64decode

fname, pattern = sys.argv[1:]
with open(fname) as f:
    for row in f:
        row = b64decode(row).decode()
        if pattern in row:
            print(row, end='\n\n')

Running this on your data with "bar" as the pattern arg gives:

<foo>
<bar/>
</foo>

<foo>
<barometer/>
</foo>

In order to practice my rather rusty awk skills, I decided to write an awk command line to do this. It uses the standard base64 command to do the decoding.

awk 'BEGIN{cmd="base64 -d"}; {print |& cmd; close(cmd,"to"); z=""; while(cmd |& getline s) z=z s "\n"; close(cmd); if (z~pat)print z}' pat='bar' testdata_b64.txt

You pass it the pattern using the pat argument, which can be a regex. You can send data to it via standard input, or you can give it one or more filenames on the commandline.

Note that regex patterns need double escaping, eg pat='\\<bar\\>' matches the word bar.

Upvotes: 3

martin_joerg
martin_joerg

Reputation: 1163

You can use tr inside a loop to remove all new lines for each of the XML documents like this:

#!/bin/bash

while IFS='' read -r line
do
    echo -n "$line" | base64 --decode | tr -d '\r\n'
    echo
done < fileContainingBase64EncodedXMLsInEachLine.txt

Upvotes: 0

Zapho Oxx
Zapho Oxx

Reputation: 355

you can try the following python script. It is not a commandline onliner but this should give you what you want. For usage do:

>python3 get_xml.py SEARCHSTRING FILENAME

output for you example was:

<foo>
<bar/>
</foo>
<foo>
<barometer/>
</foo>

script:

import base64
import sys
script_name = sys.argv[0]
search_string = sys.argv[1]
filename = sys.argv[2]
print("[+] ({}) search for {}".format(script_name,search_string,filename))
with open(filename,"r") as xml_in:
    nextline = xml_in.readline()
    while nextline != '':
        xml = base64.b64decode(nextline).decode("utf-8").rstrip()
        if search_string in xml:
            print(xml)
        nextline = xml_in.readline()

Upvotes: 0

Related Questions