Stormdamage
Stormdamage

Reputation: 389

Bash to create a JSON File Manifest

I have a bash script to output a file manifest with MD5 hashes as a JSON like so:

{
 "files": [
    {
      "md5": "f30ae4b2e0d2551b5962995426be0c3a",
      "path": "assets/asset_1.png"
    },
    {
      "md5": "ca8492fdc3547af31afeeb8656619ef0",
      "path": "assets/asset_2.png"
    },
  ]
}

It will return a list of all files except .gdz.

The command I am using is:

echo "{\"files\": [$(find . -type f -print | grep -v \.gdz$ | xargs md5sum | sed 's/\.\///' | xargs printf "{\"md5\": \"%s\", \"name\": \"%s\"}," | sed 's/,$//')]}" > files.json

However, when I run this in production, it sometimes switches the MD5 hash and the file path around. I cannot work out why this is, does anyone know?

Upvotes: 0

Views: 1108

Answers (4)

Benjamin W.
Benjamin W.

Reputation: 52461

You could run md5sum on all matching files, then do the rest with jq:

find . -type f -not -name '*.gdz' -exec md5sum -z {} + \
    | jq --slurp --raw-input '
        {
            files: split("\u0000")
                | map(split(" "))
                | map([
                    .[0],
                    (.[2:] | join(" "))
                ])
                | map({md5: .[0], path: .[1]})
        }'

The output of the find command is the output of running md5sum once on all matching files, with output records separated by null bytes.

The jq then does the following (and can almost certainly be optimized):

  • --slurp and --raw-input read the whole input before any processing
  • At the outermost level, we build an object with files as the key
  • split("\u0000") creates an array from the null byte separated input records
  • map(split(" ")) converts each array element to an array split on blanks
  • map([ .[0], (.[2:] | join(" ")) ]) – to allow blanks in filenames, we create an array for each record where the first element is the md5 hash, and the second element is the concatenation of the rest, i.e., the filename; [2:] because we want to skip two blanks
  • map({md5: .[0], path: .[1]}) converts each two-element array into an object with the desired keys

Upvotes: 1

Reino
Reino

Reputation: 3443

Trying to create a JSON with tools that aren't designed for it is a task that's way too error prone. Please use a dedicated tool to properly create the JSON you want. I'd highly recommend :

xidel -se '
  {
    "files":array{
      for $x in file:list(.,true())[not(ends-with(.,"/")) and not(ends-with(.,"gdz"))]
      return {
        "md5":substring(system(x"md5sum {$x}"),1,32),
        "path":$x
      }
    }
  }
'
  • file:list(.,true()) returns all files and directories in the current directory (and with the optional parameter $recursive set to true() all descendent directories are included as well).
  • [not(ends-with(.,"/")) and not(ends-with(.,"gdz"))] filters the output of file:list() by removing the directories and "gdz"-files.
  • system(x"md5sum {$x}") returns md5sum its stdout result as a string (and substring(..,1,32) obviously returns the first 32 characters).
    x"..{..}.." is an extended string, where x"There are {1+2+3} elements" for instance evaluates to "There are 6 elements".

Upvotes: 2

M. Nejat Aydin
M. Nejat Aydin

Reputation: 10133

You may consider trying this one using bash and GNU tools find and md5sum. The script uses NUL terminated pathnames and escapes the relevant characters. It should work even if filenames contain newline characters.

#!/bin/bash

comma=
printf '{\n  "files": [\n'
while IFS= read -d '' -r line; do
    md5=${line:0:32}
    path=${line:34}
    path=${path//'\'/'\\'}
    path=${path//'"'/'\"'}
    path=${path//$'\b'/'\b'}
    path=${path//$'\f'/'\f'}
    path=${path//$'\n'/'\n'}
    path=${path//$'\r'/'\r'}
    path=${path//$'\t'/'\t'}
    printf '%s%4s{\n%6s"md5": "%s",\n%6s"path": "%s"\n%4s}' \
        "$comma" '' '' "${md5}" '' "${path}" ''
    comma=$',\n'
done < <(find . -type f ! -name '*.gdz' -exec md5sum -z {} +)
printf '\n  ]\n}\n'

or, also using the GNU sed (this version might be faster):

#!/bin/bash

printf '{\n  "files": ['
find . -type f ! -name '*.gdz' -exec md5sum -z {} + |
sed -Ez '
s/\\/\\\\/g
s/"/\\"/g
s/\x08/\\b/g
s/\f/\\f/g
s/\n/\\n/g
s/\r/\\r/g
s/\t/\\t/g
s/(.{32})..(.*)/\
    {\
      "md5": "\1",\
      "path": "\2"\
    }/
$!s/$/,/' | tr -d '\0'

printf '\n  ]\n}\n'

I believe both scripts are robust, assuming recent GNU tools.

Upvotes: 1

Shawn
Shawn

Reputation: 52579

Doing this robustly in shell is a bit of pain; you have to worry about things like spaces in filenames (Which will break your current code), properly encoding and escaping your JSON strings (What if you have a file with quotes as part of the name?), etc.

A quick perl script that does the same, with the directory to scan passed as a command-line argument:

#!/usr/bin/env perl
use warnings;
use strict;
use File::Find;
use Digest::MD5;
use JSON::PP; # Or JSON::XS if installed

my @hashes;

find(\&wanted, @ARGV);
print JSON::PP->new->ascii->encode({files => \@hashes});

sub wanted {
    if (-f $_ && $_ !~ /\.gdz$/) {
        my $name = $File::Find::name;
        $name =~ s!^\./!!;
        open my $f, "<:raw", $_ or
            die "Couldn't open $name: $!\n";
        push @hashes, { path => $name,
                        md5 => Digest::MD5->new()->addfile($f)->hexdigest
        };
        close $f;
    }
}

Upvotes: 2

Related Questions