Reputation: 1791
I have data across several computers stored in folders. Many of the folders contain 40-100 G of files of size from 500 K to 125 MB. There are some 4 TB of files which I need to archive, and build a unfied meta data system depending on meta data stored in each computer.
All systems run Linux, and we want to use Python. What is the best way to copy the files, and archive it.
We already have programs to analyze files, and fill the meta data tables and they are all running in Python. What we need to figure out is a way to successfully copy files wuthout data loss,and ensure that the files have been copied successfully.
We have considered using rsync and unison use subprocess.POPEn to run them off, but they are essentially sync utilities. These are essentially copy once, but copy properly. Once files are copied the users would move to new storage system.
My worries are 1) When the files are copied there should not be any corruption 2) the file copying must be efficient though no speed expectations are there. The LAN is 10/100 with ports being Gigabit.
Is there any scripts which can be incorporated, or any suggestions. All computers will have ssh-keygen enabled so we can do passwordless connection.
The directory structures would be maintained on the new server, which is very similar to that of old computers.
Upvotes: 0
Views: 990
Reputation: 3188
If a more seamless python integration is the goal you can look at,
Upvotes: 1
Reputation: 5457
I think rsync is the solution. If you are concerned about data integrity, look at the explanation of the "--checksum" parameter in the man page.
Other arguments that might come in handy are "--delete" and "--archive". Make sure the exit code of the command is checked properly.
Upvotes: 0
Reputation: 1174
I would look at the python fabric library. This library is for streamlining the use of SSH, and if you are concerned about data integrity I would use SHA1 or some other hash algorithm for creating a fingerprint for each file before transfer and compare the fingerprint values generated at the initial and final destinations. All of this could be done using fabric.
Upvotes: 1