Reputation: 91
In our Datalake (Hadoop/Mapr/Redhat) we have a directory which contains more than 40M of files. We can't run a ls command.
I've tried to launch hadoop command getmerge to merge the files, but I have no output.
Hadoop fs -rm don't work too .
Is there another way to view the contenent of this folder ? How could I purge old files from it without a scan ?
Thank you
Upvotes: 4
Views: 1735
Reputation: 5947
A couple things. If you have access to the namenode or secondary you can use the hdfs oiv
to dump the HDFS to an offline delimited file then find the paths you're looking for there.
Hadoop has an existing file format called .har
which stands for Hadoop archive. If you want to preserve your files you should look into using that instead of getmerge
.
You can use distcp
to delete directories.
You can create an empty HDFS directory in /tmp and then copy the empty directory into your directory with 40M files using distcp
and do the remove with more mappers.
$ hdfs dfs -mkdir /tmp/empty_dir
$ hadoop distcp -m 20 -delete /tmp/empty_dir /PATH-OF-MILLION-FILES-DIR
The fs -rm -r
is single-threaded while distcp
runs in parallel based on mappers. You should use the -delete
option with distcp
.
References:
Upvotes: 1
Reputation: 191728
fs -rm
will move to HDFS trash, so you're not actually deleting any records, just moving them.
You need to add -skipTrash
for a purge to happen. And if would recommend that you purge in batches
For example, remove all files starting with letter a
hdfs dfs -rm -R -skipTrash /path/data/a*
'getmerge` downloads all records to your local machine, so you better be sure you have enough disk space
The only way to merge within HDFS is a MapReduce or Spark task.
It depends on your file formats, but FileCrush is a library you could look into. However, keep in mind that if you want to merge anything, you need at least 120% extra capacity on HDFS to duplicate the data + overhead for temporary files
Upvotes: 0