Reputation: 48
I use gsutil in a Linux environment for managing files in GCS. I enjoy being able to use the command
gsutil -m cp -I gs://...
preceded by some other command to pass the STDIN to gsutil for uploading files; in doing so, I can maintain a local list of files that have been uploaded or generate specific patterns to upload and hand them off.
I would like to be able to do a similar command like
gsutil -m rm -I gs://...
to scrub files similarly. Presently, I build a big list of files to remove and run it with the following code:
while read line
do
gsutil rm gs://...
done < "$myfile.txt"
This is extraordinarily slow compared to the multithreaded "gsutil -m rm..." command, and enabling the -m flag has no effect when you have to process files one at a time from a list. I also experimented with just running
gsutil -m rm gs://.../* # remove everything
<my command> | gsutil -m cp -I gs://.../ # put back the pieces that I want
but this involves recopying a lot of a data and wastes a lot of time; the data is already there and just needs to have some removed. Any thoughts would be appreciated. Also, I don't have a lot of flexibility on either end with renaming files; otherwise, a quick rename before uploading would handle all of this.
Upvotes: 1
Views: 1179
Reputation: 48
For anyone wondering, I wound up doing like Zach Wilt indicated above. For reference, I was removing on the order of a couple thousand files from a span of 5 directories, so roughly 10,000 files. Doing this without the "-m" switch was taking upwards of 30 minutes; with the "-m" switch, it takes less than 30 seconds. Zoom!
For a robust example: I am using this to update Google Cloud Storage files to match local files. On the current day, I have a program that dumps lots of files that are incremental, and also a handful that are "rolled up". After a week, the incremental files get scrubbed locally automatically, but the same should happen in GCS to save the space. Here's how to do this:
#!/bin/bash
# get the full date strings for touch
start=`date --date='-9 days' +%x`
end=`date --date='-8 days' +%x`
# other vars
mon=`date --date='-9 days' +%b | tr [A-Z] [a-z]`
day=`date --date='-9 days' +%d`
# display start and finish times
echo "Cleaning files from $start"
# update start and finish times
touch --date="$start" /tmp/start1
touch --date="$end" /tmp/end1
# repeat for all servers
for dr in "dir1" "dir2" "dir3" ...
do
# list files in range and build retention file
find /local/path/$dr/ -newer /tmp/start1 ! -newer /tmp/end1 > "$dr-local.txt"
# get list of all files from appropriate folder on GCS
gsutil ls gs://gcs_path/$mon/$dr/$day/ > "$dr-gcs.txt"
# formatting the host list file
sed -i "s|gs://gcs_path/$mon/$dr/$day/|/local/path/$dr/|" "$dr-gcs.txt"
# build sed command file to delete matches
while read line
do
echo "\|$line|d" >> "$dr-del.txt"
done < "$dr-local.txt"
# run command file to strip lines for files that need to remain
sed -f "$dr-del.txt" <"$dr-gcs.txt" >"$dr-out.txt"
# convert local names to GCS names
sed -i "s|/local/path/$dr/|gs://gcs_path/$mon/$dr/$day/|" "$dr-out.txt"
# new variable to hold string
del=""
# convert newline separated file to one long string
while read line
do
del="$del$line "
done < "$dr-out.txt"
# remove all files matching the final output
gsutil -m rm $del
# cleanup files
rm $dr-local.txt
rm $dr-gcs.txt
rm $dr-del.txt
rm $dr-out.txt
done
You'll need to modify to fit your needs, but this is a concrete and working method for deleting files locally, and then synchronizing the change to Google Cloud Storage. Obviously, modify to fit your needs. Thanks again to @Zach Wilt.
Upvotes: 1
Reputation: 446
As an interim solution, since we don't have a -I
option for rm
right now, how about just creating a string of all the objects you want to delete in your loop and then using gsutil -m rm
to delete it? You could also do this with a simple python script that invokes the gsutil command from within python as a separate process.
Expanding on your earlier example, maybe something like the following (disclaimer: my bash-fu isn't the greatest, and I haven't tested this):
objects=''
while read line
do
objects="$objects gs://$line"
done
gsutil -m rm $objects
Upvotes: 3