Ayman Anikad
Ayman Anikad

Reputation: 91

How could I purge or merge milions of files in HDFS?

In our Datalake (Hadoop/Mapr/Redhat) we have a directory which contains more than 40M of files. We can't run a ls command.

I've tried to launch hadoop command getmerge to merge the files, but I have no output.

Hadoop fs -rm don't work too .

Is there another way to view the contenent of this folder ? How could I purge old files from it without a scan ?

Thank you

Upvotes: 4

Views: 1735

Answers (2)

tk421
tk421

Reputation: 5947

A couple things. If you have access to the namenode or secondary you can use the hdfs oiv to dump the HDFS to an offline delimited file then find the paths you're looking for there.

Hadoop has an existing file format called .har which stands for Hadoop archive. If you want to preserve your files you should look into using that instead of getmerge.

You can use distcp to delete directories.

You can create an empty HDFS directory in /tmp and then copy the empty directory into your directory with 40M files using distcp and do the remove with more mappers.

$ hdfs dfs -mkdir /tmp/empty_dir
$ hadoop distcp -m 20 -delete /tmp/empty_dir /PATH-OF-MILLION-FILES-DIR

The fs -rm -r is single-threaded while distcp runs in parallel based on mappers. You should use the -delete option with distcp.

References:

Upvotes: 1

OneCricketeer
OneCricketeer

Reputation: 191728

fs -rm will move to HDFS trash, so you're not actually deleting any records, just moving them.

You need to add -skipTrash for a purge to happen. And if would recommend that you purge in batches

For example, remove all files starting with letter a

hdfs dfs -rm -R -skipTrash /path/data/a*

'getmerge` downloads all records to your local machine, so you better be sure you have enough disk space

The only way to merge within HDFS is a MapReduce or Spark task.

It depends on your file formats, but FileCrush is a library you could look into. However, keep in mind that if you want to merge anything, you need at least 120% extra capacity on HDFS to duplicate the data + overhead for temporary files

Upvotes: 0

Related Questions