Reputation: 514
I am currently working with the s3a adapter of Hadoop/HDFS to allow me to upload a number of files from a Hive database to a particular s3 bucket. I'm getting nervous because I can't find anything online about specifying a bunch of filepaths (not directories) for copy via distcp.
I have set up my program to collect an array of filepaths using a function, inject them all into a distcp command, and then run the command:
files = self.get_files_for_upload()
if not files:
logger.warning("No recently updated files found. Exiting...")
return
full_path_files = [f"hdfs://nameservice1{file}" for file in files]
s3_dest = "path/to/bucket"
cmd = f"hadoop distcp -update {' '.join(full_path_files)} s3a://{s3_dest}"
logger.info(f"Preparing to upload Hive data files with cmd: \n{cmd}")
result = subprocess.run(cmd, shell=True, check=True)
This basically just creates one long distcp command with 15-20 different filepaths. Will this work? Should I be using the -cp
or -put
commands instead of distcp
?
(It doesn't make sense to me to copy all these files to their own directory and then distcp that entire directory, when I can just copy them directly and skip those steps...)
Upvotes: 0
Views: 820
Reputation: 191743
-cp
and -put
would require you to download the HDFS files, then upload to S3. That would be a lot slower.
I see no immediate reason why this wouldn't work, however, reading over the documentation, I would recommend using -f
flag instead.
E.g.
files = self.get_files_for_upload()
if not files:
logger.warning("No recently updated files found. Exiting...")
return
src_file = 'to_copy.txt'
with open(src_file, 'w') as f:
for file in files:
f.write(f'hdfs://nameservice1{file}\n')
s3_dest = "path/to/bucket"
result = subprocess.run(['hadoop', 'distcp', '-f', src_file, f's3a://{s3_dest}'], shell=True, check=True)
If the all files were already in their own directory, then you should just copy the directory, like you said.
Upvotes: 1