Reputation: 9576

bash scripting de-dupe

I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.

This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.

Easiest way to do this?

Thanks!

Upvotes: 5

Answers (4)

c00kiemon5ter

Reputation: 17674

Do you really need to compress the file ?
wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt

The first time you do:

$ wget -N www.example.com/file.txt
[...]
[...] file.txt saved [..size..]

The next time it'll be like this:

$ wget -N www.example.com/file.txt
Server file no newer than local file “file.txt” -- not retrieving.

Except if the file on the server was updated.

That would solve your problem, if you didn't compress the file.
If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?

You only know that, make some tests.

So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
I would do something like this (in Bash):

newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
oldfilesum="$(xzcat file.txt.xz | sha256sum)"
if [[ $newfilesum != $oldfilesum ]]; then
    xz -f file.txt # overwrite with the new compressed data
else
    rm file.txt
fi

and that's done;

Upvotes: 5

Diego Sevilla

Reputation: 29021

Calculate a hash of the content of the file and check against the new one. Use for instance md5sum. You only have to save the last MD5 sum to check if the file changed.

Also, take into account that the web is evolving to give more information on pages, that is, metadata. A well-founded web site should include file version and/or date of modification (or a valid, expires header) as part of the response headers. This, and quite other things, is what makes up the scalability of Web 2.0.

Upvotes: 1

David W.

Reputation: 107090

You can compare the new file with the last one using the sum command. This takes the checksum of the file. If both files have the same checksum, they are very, very likely to be exactly the same. There's another command called md5 that takes the md5 fingerprint, but the sum command is on all systems.

Upvotes: 0

Ryan Leonard

Reputation: 997

How about downloading the file, and checking it against a "last saved" file?

For example, the first time it downloads myfile, and saves it as myfile-[date], and compresses it. It also adds a symbolic link, such as lastfile pointing to myfile-[date]. The next time the script runs, it can check if the contents of whatever lastfile points to is the same as the new downloaded file.

Don't know if this would work well, but it's what I could think of.

Upvotes: 0

bash scripting de-dupe

Answers (4)

Related Questions