Reputation: 7068
Surely there must be a way to do this easily!
I've tried the Linux command-line apps such as sha1sum
and md5sum
but they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.
I need to generate a single hash for the entire contents of a folder (not just the filenames).
I'd like to do something like
sha1sum /folder/of/stuff > singlehashvalue
Edit: to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.
Upvotes: 165
Views: 128598
Reputation: 52659
# 1. How to get a sha256 hash over all file contents in a folder, including
# hashing over the relative file paths within that folder to check the
# filenames themselves (get this bash function below).
sha256sum_dir "path/to/folder"
# 2. How to quickly compare two folders (get the `diff_dir` bash function below)
diff_dir "path/to/folder1" "path/to/folder2"
# OR:
diff -r -q "path/to/folder1" "path/to/folder2"
Do this instead of the main answer, to get a single hash for all non-directory file contents within an entire folder, no matter where the folder is located:
This is a "1-line" command. Copy and paste the whole thing to run it all at once:
# This one works, but don't use it, because its hash output does NOT
# match that of my `sha256sum_dir` function. I recommend you use
# the "1-liner" just below, therefore, instead.
time ( \
starting_dir="$(pwd)" \
&& target_dir="path/to/folder" \
&& cd "$target_dir" \
&& find . -not -type d -print0 | sort -zV \
| xargs -0 sha256sum | sha256sum; \
cd "$starting_dir"
)
However, that produces a slightly different hash than my sha256sum_dir
bash function, which I present below, produces. So, to get the output hash to exactly match the output from my sha256sum_dir
function, do this instead:
# Use this one, as its output matches that of my `sha256sum_dir`
# function exactly.
all_hashes_str="$( \
starting_dir="$(pwd)" \
&& target_dir="path/to/folder" \
&& cd "$target_dir" \
&& find . -not -type d -print0 | sort -zV | xargs -0 sha256sum \
)"; \
cd "$starting_dir"; \
printf "%s" "$all_hashes_str" | sha256sum
For more on why the main answer doesn't produce identical hashes for identical folders in different locations, see further below.
sha256sum_dir
and diff_dir
Place the following functions in your ~/.bashrc
file or in your ~/.bash_aliases
file, assuming your ~/.bashrc
file sources the ~/.bash_aliases
file like this:
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi
You can find both of the functions below in my personal ~/.bash_aliases
file in my eRCaGuy_dotfiles repo.
Here is the sha256sum_dir
function, which obtains a total "directory" hash of all files in the directory:
# Take the sha256sum of all files in an entire dir, and then sha256sum that
# entire output to obtain a _single_ sha256sum which represents the _entire_
# dir.
# See:
# 1. [my answer] https://stackoverflow.com/a/72070772/4561887
sha256sum_dir() {
return_code="$RETURN_CODE_SUCCESS"
if [ "$#" -eq 0 ]; then
echo "ERROR: too few arguments."
return_code="$RETURN_CODE_ERROR"
fi
# Print help string if requested
if [ "$#" -eq 0 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
# Help string
echo "Obtain a sha256sum of all files in a directory."
echo "Usage: ${FUNCNAME[0]} [-h|--help] <dir>"
return "$return_code"
fi
starting_dir="$(pwd)"
target_dir="$1"
cd "$target_dir"
# See my answer: https://stackoverflow.com/a/72070772/4561887
filenames="$(find . -not -type d | sort -V)"
IFS=$'\n' read -r -d '' -a filenames_array <<< "$filenames"
time all_hashes_str="$(sha256sum "${filenames_array[@]}")"
cd "$starting_dir"
echo ""
echo "Note: you may now call:"
echo "1. 'printf \"%s\n\" \"\$all_hashes_str\"' to view the individual" \
"hashes of each file in the dir. Or:"
echo "2. 'printf \"%s\" \"\$all_hashes_str\" | sha256sum' to see that" \
"the hash of that output is what we are using as the final hash" \
"for the entire dir."
echo ""
printf "%s" "$all_hashes_str" | sha256sum | awk '{ print $1 }'
return "$?"
}
# Note: I prefix this with my initials to find my custom functions easier
alias gs_sha256sum_dir="sha256sum_dir"
Assuming you just want to compare two directories for equality, you can use diff -r -q "dir1" "dir2"
instead, which I wrapped in this diff_dir
command. I learned about the diff
command to compare entire folders here: how do I check that two folders are the same in linux.
# Compare dir1 against dir2 to see if they are equal or if they differ.
# See:
# 1. How to `diff` two dirs: https://stackoverflow.com/a/16404554/4561887
diff_dir() {
return_code="$RETURN_CODE_SUCCESS"
if [ "$#" -eq 0 ]; then
echo "ERROR: too few arguments."
return_code="$RETURN_CODE_ERROR"
fi
# Print help string if requested
if [ "$#" -eq 0 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
echo "Compare (diff) two directories to see if dir1 contains the same" \
"content as dir2."
echo "NB: the output will be **empty** if both directories match!"
echo "Usage: ${FUNCNAME[0]} [-h|--help] <dir1> <dir2>"
return "$return_code"
fi
dir1="$1"
dir2="$2"
time diff -r -q "$dir1" "$dir2"
return_code="$?"
if [ "$return_code" -eq 0 ]; then
echo -e "\nDirectories match!"
fi
# echo "$return_code"
return "$return_code"
}
# Note: I prefix this with my initials to find my custom functions easier
alias gs_diff_dir="diff_dir"
Here is the output of my sha256sum_dir
command on my ~/temp2
dir (which dir I describe just below so you can reproduce it and test this yourself). You can see the total folder hash is b86c66bcf2b033f65451e8c225425f315e618be961351992b7c7681c3822f6a3
in this case:
$ gs_sha256sum_dir ~/temp2
real 0m0.007s
user 0m0.000s
sys 0m0.007s
Note: you may now call:
1. 'printf "%s\n" "$all_hashes_str"' to view the individual hashes of each
file in the dir. Or:
2. 'printf "%s" "$all_hashes_str" | sha256sum' to see that the hash of that
output is what we are using as the final hash for the entire dir.
b86c66bcf2b033f65451e8c225425f315e618be961351992b7c7681c3822f6a3
Here is the cmd and output of diff_dir
to compare two dirs for equality. This is checking that copying an entire directory to my SD card just now worked correctly. I made the output indicate Directories match!
whenever that is the case!:
$ gs_diff_dir "path/to/sd/card/tempdir" "/home/gabriel/tempdir"
real 0m0.113s
user 0m0.037s
sys 0m0.077s
Directories match!
I tried the most-upvoted answer here, and it doesn't work quite right as-is. It needs a little tweaking. It doesn't work quite right because the hash changes based on the folder-of-interest's base path! That means that an identical copy of some folder will have a different hash than the folder it was copied from even if the two folders are perfect matches and contain exactly the same content! That kind of defeats the purpose of taking a hash of the folder if the hashes of two identical folders differ! Let me explain:
Assume I have a folder named temp2
at ~/temp2
. It contains file1.txt
, file2.txt
, and file3.txt
. file1.txt
contains the letter a
followed by a return, file2.txt
contains a letter b
followed by a return, and file3.txt
contains a letter c
followed by a return.
If I run find /home/gabriel/temp2
, I get:
$ find /home/gabriel/temp2
/home/gabriel/temp2
/home/gabriel/temp2/file3.txt
/home/gabriel/temp2/file1.txt
/home/gabriel/temp2/file2.txt
If I forward that to sha256sum
(in place of sha1sum
) in the same pattern as the main answer states, I get this. Notice it has the full path after each hash, which is not what we want:
$ find /home/gabriel/temp2 -type f -print0 | sort -z | xargs -0 sha256sum
87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7 /home/gabriel/temp2/file1.txt
0263829989b6fd954f72baaf2fc64bc2e2f01d692d4de72986ea808f6e99813f /home/gabriel/temp2/file2.txt
a3a5e715f0cc574a73c3f9bebb6bc24f32ffd5b67b387244c2c909da779a1478 /home/gabriel/temp2/file3.txt
If you then pipe that output string above to sha256sum
again, it hashes the file hashes with their full file paths, which is not what we want! The file hashes may match in a folder and in a copy of that folder exactly, but the absolute paths do NOT match exactly, so they will produce different final hashes since we are hashing over the full file paths as part of our single, final hash!
Instead, what we want is the relative file path next to each hash. To do that, you must first cd
into the folder of interest, and then run the hash command over all files therein, like this:
cd "/home/gabriel/temp2" && find . -type f -print0 | sort -z | xargs -0 sha256sum
Now, I get this. Notice the file paths are all relative now, which is what I want!:
$ cd "/home/gabriel/temp2" && find . -type f -print0 | sort -z | xargs -0 sha256sum
87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7 ./file1.txt
0263829989b6fd954f72baaf2fc64bc2e2f01d692d4de72986ea808f6e99813f ./file2.txt
a3a5e715f0cc574a73c3f9bebb6bc24f32ffd5b67b387244c2c909da779a1478 ./file3.txt
Good. Now, if I hash that entire output string, since the file paths are all relative in it, the final hash will match exactly for a folder and its copy! In this way, we are hashing over the file contents and the file names within the directory of interest, to get a different hash for a given folder if either the file contents are different or the filenames are different, or both.
Upvotes: 3
Reputation: 2016
I've written a Groovy script to do this:
import java.security.MessageDigest
public static String generateDigest(File file, String digest, int paddedLength){
MessageDigest md = MessageDigest.getInstance(digest)
md.reset()
def files = []
def directories = []
if(file.isDirectory()){
file.eachFileRecurse(){sf ->
if(sf.isFile()){
files.add(sf)
}
else{
directories.add(file.toURI().relativize(sf.toURI()).toString())
}
}
}
else if(file.isFile()){
files.add(file)
}
files.sort({a, b -> return a.getAbsolutePath() <=> b.getAbsolutePath()})
directories.sort()
files.each(){f ->
println file.toURI().relativize(f.toURI()).toString()
f.withInputStream(){is ->
byte[] buffer = new byte[8192]
int read = 0
while((read = is.read(buffer)) > 0){
md.update(buffer, 0, read)
}
}
}
directories.each(){d ->
println d
md.update(d.getBytes())
}
byte[] digestBytes = md.digest()
BigInteger bigInt = new BigInteger(1, digestBytes)
return bigInt.toString(16).padLeft(paddedLength, '0')
}
println "\n${generateDigest(new File(args[0]), 'SHA-256', 64)}"
You can customize the usage to avoid printing each file, change the message digest, take out directory hashing, etc. I've tested it against the NIST test data and it works as expected. http://www.nsrl.nist.gov/testdata/
gary-macbook:Scripts garypaduana$ groovy dirHash.groovy /Users/garypaduana/.config
.DS_Store
configstore/bower-github.yml
configstore/insight-bower.json
configstore/update-notifier-bower.json
filezilla/filezilla.xml
filezilla/layout.xml
filezilla/lockfile
filezilla/queue.sqlite3
filezilla/recentservers.xml
filezilla/sitemanager.xml
gtk-2.0/gtkfilechooser.ini
a/
configstore/
filezilla/
gtk-2.0/
lftp/
menus/
menus/applications-merged/
79de5e583734ca40ff651a3d9a54d106b52e94f1f8c2cd7133ca3bbddc0c6758
Upvotes: 1
Reputation: 318
There is a python script for that:
https://code.activestate.com/recipes/576973-getting-the-sha-1-or-md5-hash-of-a-directory/
If you change the names of a file without changing their alphabetical order, the hash script will not detect it. But, if you change the order of the files or the contents of any file, running the script will give you a different hash than before.
Upvotes: 2
Reputation: 627
Adding multiprocessing and progressbar to kvantour's answer
Around 30x faster (depending on CPU)
100%|██████████████████████████████████| 31378/31378 [03:03<00:00, 171.43file/s]
# to hash without permissions
find . -type f -print0 | sort -z | xargs -P $(nproc --all) -0 sha1sum | tqdm --unit file --total $(find . -type f | wc -l) | sort | awk '{ print $1 }' | sha1sum
# to hash permissions
(find . -type f -print0 | sort -z | xargs -P $(nproc --all) -0 sha1sum | sort | awk '{ print $1 }';
find . \( -type f -o -type d \) -print0 | sort -z | xargs -P $(nproc --all) -0 stat -c '%n %a') | \
sort | sha1sum | awk '{ print $1 }'
make sure tqdm is installed, pip install tqdm
or check documentation.
awk
will remove the filepath so that if the parent directory or path is different it wouldn't affect the hash.
Upvotes: 3
Reputation: 2348
Here's a simple, short variant in Python 3 that works fine for small-sized files (e.g. a source tree or something, where every file individually can fit into RAM easily), ignoring empty directories, based on the ideas from the other solutions:
import os, hashlib
def hash_for_directory(path, hashfunc=hashlib.sha1):
filenames = sorted(os.path.join(dp, fn) for dp, _, fns in os.walk(path) for fn in fns)
index = '\n'.join('{}={}'.format(os.path.relpath(fn, path), hashfunc(open(fn, 'rb').read()).hexdigest()) for fn in filenames)
return hashfunc(index.encode('utf-8')).hexdigest()
It works like this:
You can pass in a different hash function as second parameter if SHA-1 is not your cup of tea.
Upvotes: 2
Reputation: 398
This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.
Here's a tool, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.
dtreetrawl
.Usage: dtreetrawl [OPTION...] "/trawl/me" [path2,...] Help Options: -h, --help Show help options Application Options: -t, --terse Produce a terse output; parsable. -j, --json Output as JSON -d, --delim=: Character or string delimiter/separator for terse output(default ':') -l, --max-level=N Do not traverse tree beyond N level(s) --hash Enable hashing(default is MD5). -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512. -R, --only-root-hash Output only the root hash. Blank line if --hash is not set -N, --no-name-hash Exclude path name while calculating the root checksum -F, --no-content-hash Do not hash the contents of the file -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum -e, --hash-dirent Include hash of directory entries while calculating root checksum
A snippet of human friendly output:
... ... //clipped ... /home/lab/linux-4.14-rc8/CREDITS Base name : CREDITS Level : 1 Type : regular file Referent name : File size : 98443 bytes I-node number : 290850 No. directory entries : 0 Permission (octal) : 0644 Link count : 1 Ownership : UID=0, GID=0 Preferred I/O block size : 4096 bytes Blocks allocated : 200 Last status change : Tue, 21 Nov 17 21:28:18 +0530 Last file access : Thu, 28 Dec 17 00:53:27 +0530 Last file modification : Tue, 21 Nov 17 21:28:18 +0530 Hash : 9f0312d130016d103aa5fc9d16a2437e Stats for /home/lab/linux-4.14-rc8: Elapsed time : 1.305767 s Start time : Sun, 07 Jan 18 03:42:39 +0530 Root hash : 434e93111ad6f9335bb4954bc8f4eca4 Hash type : md5 Depth : 8 Total, size : 66850916 bytes entries : 12484 directories : 763 regular files : 11715 symlinks : 6 block devices : 0 char devices : 0 sockets : 0 FIFOs/pipes : 0
Upvotes: 8
Reputation: 8173
So far the fastest way to do it is still with tar. And with several additional parameters we can also get rid of the difference caused by metadata.
To use GNU tar for hash the dir, one need to make sure you sort the path during tar, otherwise it is always different.
tar -C <root-dir> -cf - --sort=name <dir> | sha256sum
If you do not care about the access time or modify time also use something like --mtime='UTC 2019-01-01'
to make sure all timestamp is the same.
Usually we need to add --group=0 --owner=0 --numeric-owner
to unify the owner metadata.
use --exclude=PATTERN
It is highly recommanded that you always compare the permissions.
If you really do not want to compare the permissions use:
--mode=777
This will fore all file permission to 777.
example:
$ echo a > test1/a.txt
$ echo b > test1/b.txt
$ tar -C ./ -cf - --sort=name test1 | sha256sum
e159ca984835cf4e1c9c7e939b7069d39b2fd2aa90460877f68f624458b1c95c -
$ tar -C ./ -cf - --sort=name --mode=777 test1 | sha256sum
ef84fe411fb49bcf7967715b7854075004f1c7a7e4a57d2f3742afa4a54c40de -
$ chmod 444 test1/a.txt
$ tar -C ./ -cf - --sort=name --mode=777 test1 | sha256sum
ef84fe411fb49bcf7967715b7854075004f1c7a7e4a57d2f3742afa4a54c40de -
$ tar -C ./ -cf - --sort=name test1 | sha256sum
9b91430d954abb8a361b01de30f0995fb94a511c8fe1f7177ddcd475c85c65ff -
it is known some tar does not have --sort
, be sure you have GNU tar.
Upvotes: 26
Reputation: 59346
Use a file system intrusion detection tool like aide.
hash a tar ball of the directory:
tar cvf - /path/to/folder | sha1sum
Code something yourself, like vatine's oneliner:
find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
Upvotes: 53
Reputation: 52659
Assuming you are trying to compare a folder and all its contents to ensure it was copied correctly from one computer to another, for instance, you can do it as follows. Let's assume the folder is named mydir
and is at path /home/gabriel/mydir
on computer 1, and at /home/gabriel/dev/repos/mydir
on computer 2.
# 1. First, cd to the dir in which the dir of interest is found. This is
# important! If you don't do this, then the paths output by find will differ
# between the two computers since the absolute paths to `mydir` differ. We are
# going to hash the paths too, not just the file contents, so this matters.
cd /home/gabriel # on computer 1
cd /home/gabriel/dev/repos # on computer 2
# 2. hash all files inside `mydir`, then hash the list of all hashes and their
# respective file paths. This obtains one single final hash. Sorting is
# necessary by piping to `sort` to ensure we get a consistent file order in
# order to ensure a consistent final hash result.
find mydir -type f -exec sha256sum {} + | sort | sha256sum
# Optionally pipe that output to awk to filter in on just the hash (first field
# in the output)
find mydir -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
That's it!
To see the intermediary list of file hashes, for learning's sake, just run this:
find mydir -type f -exec sha256sum {} + | sort
Note that the above commands ignore empty directories, file permissions, timestamps of when files were last edited, etc. For most cases though that's ok.
Here is a real run and actual output. I wanted to ensure my eclipse-workspace
folder was properly copied from one computer to another. As you can see, the time
command tells me it took 11.790 seconds:
$ time find eclipse-workspace -type f -exec sha256sum {} + | sort | sha256sum
8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4 -
real 0m11.790s
user 0m11.372s
sys 0m0.432s
The hash I care about is: 8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4
If piping to awk
and excluding time
, I get:
$ find eclipse-workspace -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4
Be sure you check find
for errors in the printed stderr
output, as a hash will be produced even in the event find
fails.
Hashing my whole eclipse-workspace
dir in only 12 seconds is impressive considering it contains 6480 files, as shown by this:
find eclipse-workspace -type f | wc -l
...and is 3.6 GB in size, as shown by this:
du -sh eclipse-workspace
Other credit: I had a chat with ChatGPT to learn some of the pieces above. All work and text above, however, was written by me, tested by me, and verified by me.
Upvotes: 0
Reputation: 61
You can try hashdir which is an open source command line tool written for this purpose.
hashdir /folder/of/stuff
It has several useful flags to allow you to specify the hashing algorithm, print the hashes of all children, as well as save and verify a hash.
hashdir:
A command-line utility to checksum directories and files.
Usage:
hashdir [options] [<item>...] [command]
Arguments:
<item> Directory or file to hash/check
Options:
-t, --tree Print directory tree
-s, --save Save the checksum to a file
-i, --include-hidden-files Include hidden files
-e, --skip-empty-dir Skip empty directories
-a, --algorithm <md5|sha1|sha256|sha384|sha512> The hash function to use [default: sha1]
--version Show version information
-?, -h, --help Show help and usage information
Commands:
check <item> Verify that the specified hash file is valid.
Upvotes: 3
Reputation: 49
Another tool to achieve this:
http://md5deep.sourceforge.net/
As is sounds: like md5sum but also recursive, plus other features.
md5deep -r {direcotory}
Upvotes: 4
Reputation: 13793
If this is a git repo and you want to ignore any files in .gitignore
, you might want to use this:
git ls-files <your_directory> | xargs sha256sum | cut -d" " -f1 | sha256sum | cut -d" " -f1
This is working well for me.
Upvotes: 8
Reputation: 21258
One possible way would be:
sha1sum path/to/folder/* | sha1sum
If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be
find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
And, finally, if you also need to take account of permissions and empty directories:
(find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum;
find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
xargs -0 stat -c '%n %a') \
| sha1sum
The arguments to stat
will cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.
Upvotes: 205
Reputation: 13087
I had to check into a whole directory for file changes.
But with excluding, timestamps, directory ownerships.
Goal is to get a sum identical anywhere, if the files are identical.
Including hosted into other machines, regardless anything but the files, or a change into them.
md5sum * | md5sum | cut -d' ' -f1
It generate a list of hash by file, then concatenate those hashes into one.
This is way faster than the tar method.
For a stronger privacy in our hashes, we can use sha512sum on the same recipe.
sha512sum * | sha512sum | cut -d' ' -f1
The hashes are also identicals anywhere using sha512sum but there is no known way to reverse it.
Upvotes: 2
Reputation: 1599
If you just want to check if something in the folder changed, I'd recommend this one:
ls -alR --full-time /folder/of/stuff | sha1sum
It will just give you a hash of the ls output, that contains folders, sub-folders, their files, their timestamp, size and permissions. Pretty much everything that you would need to determine if something has changed.
Please note that this command will not generate hash for each file, but that is why it should be faster than using find.
Upvotes: 28
Reputation: 45081
I would pipe the results for individual files through sort
(to prevent a mere reordering of files to change the hash) into md5sum
or sha1sum
, whichever you choose.
Upvotes: 1
Reputation: 2361
You could sha1sum
to generate the list of hash values and then sha1sum
that list again, it depends on what exactly it is you want to accomplish.
Upvotes: 0
Reputation: 7639
Try to make it in two steps:
Like so:
# for FILE in `find /folder/of/stuff -type f | sort`; do sha1sum $FILE >> hashes; done
# sha1sum hashes
Or do it all at once:
# cat `find /folder/of/stuff -type f | sort` | sha1sum
Upvotes: 1
Reputation:
If you just want to hash the contents of the files, ignoring the filenames then you can use
cat $FILES | md5sum
Make sure you have the files in the same order when computing the hash:
cat $(echo $FILES | sort) | md5sum
But you can't have directories in your list of files.
Upvotes: 4