Reputation: 389
I have a bash script to output a file manifest with MD5 hashes as a JSON like so:
{
"files": [
{
"md5": "f30ae4b2e0d2551b5962995426be0c3a",
"path": "assets/asset_1.png"
},
{
"md5": "ca8492fdc3547af31afeeb8656619ef0",
"path": "assets/asset_2.png"
},
]
}
It will return a list of all files except .gdz.
The command I am using is:
echo "{\"files\": [$(find . -type f -print | grep -v \.gdz$ | xargs md5sum | sed 's/\.\///' | xargs printf "{\"md5\": \"%s\", \"name\": \"%s\"}," | sed 's/,$//')]}" > files.json
However, when I run this in production, it sometimes switches the MD5 hash and the file path around. I cannot work out why this is, does anyone know?
Upvotes: 0
Views: 1108
Reputation: 52461
You could run md5sum
on all matching files, then do the rest with jq:
find . -type f -not -name '*.gdz' -exec md5sum -z {} + \
| jq --slurp --raw-input '
{
files: split("\u0000")
| map(split(" "))
| map([
.[0],
(.[2:] | join(" "))
])
| map({md5: .[0], path: .[1]})
}'
The output of the find
command is the output of running md5sum
once on all matching files, with output records separated by null bytes.
The jq then does the following (and can almost certainly be optimized):
--slurp
and --raw-input
read the whole input before any processingfiles
as the keysplit("\u0000")
creates an array from the null byte separated input recordsmap(split(" "))
converts each array element to an array split on blanksmap([ .[0], (.[2:] | join(" ")) ])
– to allow blanks in filenames, we create an array for each record where the first element is the md5 hash, and the second element is the concatenation of the rest, i.e., the filename; [2:]
because we want to skip two blanksmap({md5: .[0], path: .[1]})
converts each two-element array into an object with the desired keysUpvotes: 1
Reputation: 3443
Trying to create a JSON with tools that aren't designed for it is a task that's way too error prone. Please use a dedicated tool to properly create the JSON you want. I'd highly recommend xidel:
xidel -se '
{
"files":array{
for $x in file:list(.,true())[not(ends-with(.,"/")) and not(ends-with(.,"gdz"))]
return {
"md5":substring(system(x"md5sum {$x}"),1,32),
"path":$x
}
}
}
'
file:list(.,true())
returns all files and directories in the current directory (and with the optional parameter $recursive
set to true()
all descendent directories are included as well).[not(ends-with(.,"/")) and not(ends-with(.,"gdz"))]
filters the output of file:list()
by removing the directories and "gdz"-files.system(x"md5sum {$x}")
returns md5sum
its stdout result as a string (and substring(..,1,32)
obviously returns the first 32 characters).x"..{..}.."
is an extended string, where x"There are {1+2+3} elements"
for instance evaluates to "There are 6 elements".Upvotes: 2
Reputation: 10133
You may consider trying this one using bash
and GNU tools find
and md5sum
. The script uses NUL terminated pathnames and escapes the relevant characters. It should work even if filenames contain newline characters.
#!/bin/bash
comma=
printf '{\n "files": [\n'
while IFS= read -d '' -r line; do
md5=${line:0:32}
path=${line:34}
path=${path//'\'/'\\'}
path=${path//'"'/'\"'}
path=${path//$'\b'/'\b'}
path=${path//$'\f'/'\f'}
path=${path//$'\n'/'\n'}
path=${path//$'\r'/'\r'}
path=${path//$'\t'/'\t'}
printf '%s%4s{\n%6s"md5": "%s",\n%6s"path": "%s"\n%4s}' \
"$comma" '' '' "${md5}" '' "${path}" ''
comma=$',\n'
done < <(find . -type f ! -name '*.gdz' -exec md5sum -z {} +)
printf '\n ]\n}\n'
or, also using the GNU sed
(this version might be faster):
#!/bin/bash
printf '{\n "files": ['
find . -type f ! -name '*.gdz' -exec md5sum -z {} + |
sed -Ez '
s/\\/\\\\/g
s/"/\\"/g
s/\x08/\\b/g
s/\f/\\f/g
s/\n/\\n/g
s/\r/\\r/g
s/\t/\\t/g
s/(.{32})..(.*)/\
{\
"md5": "\1",\
"path": "\2"\
}/
$!s/$/,/' | tr -d '\0'
printf '\n ]\n}\n'
I believe both scripts are robust, assuming recent GNU tools.
Upvotes: 1
Reputation: 52579
Doing this robustly in shell is a bit of pain; you have to worry about things like spaces in filenames (Which will break your current code), properly encoding and escaping your JSON strings (What if you have a file with quotes as part of the name?), etc.
A quick perl
script that does the same, with the directory to scan passed as a command-line argument:
#!/usr/bin/env perl
use warnings;
use strict;
use File::Find;
use Digest::MD5;
use JSON::PP; # Or JSON::XS if installed
my @hashes;
find(\&wanted, @ARGV);
print JSON::PP->new->ascii->encode({files => \@hashes});
sub wanted {
if (-f $_ && $_ !~ /\.gdz$/) {
my $name = $File::Find::name;
$name =~ s!^\./!!;
open my $f, "<:raw", $_ or
die "Couldn't open $name: $!\n";
push @hashes, { path => $name,
md5 => Digest::MD5->new()->addfile($f)->hexdigest
};
close $f;
}
}
Upvotes: 2