Matthias Langer
Matthias Langer

Reputation: 1024

Putting many small files to HDFS to train/evaluate model

I want to extract the contents of some large tar.gz archives, that contain millions of small files, to HDFS. After the data has been uploaded, it should be possible to access individual files in the archive by their paths, and list them. The most straight forward solution would be to write a small script, that extracts these archives to some HDFS base folder. However, since HDFS is known not to deal particularly well with small files, I'm wondering how this solution can be improved. These are the potential approaches I found so far:

Ideally, I want the solution to play well with Spark, meaning that accessing the data with Spark should not be more complicated than it was, if the data was extracted to HDFS directly. What are your suggestions and experiences in this domain?


Upvotes: 0

Views: 101

Answers (2)

Sergei Yendiyarov
Sergei Yendiyarov

Reputation: 535

Sequence files are the great way to handle small files hadoop problem.

Upvotes: 0

Matt Andruff
Matt Andruff

Reputation: 5135

You can land the files into a landing zone and then process them into something useful.

zcat <infile> | hdfs dfs -put - /LandingData/

Then build a table on top of that 'landed' data. Use Hive or Spark.

Then write out a new table (in a new folder) using the format of Parquet or ORC.

Whenever you need to run analytics on the data use this new table, it will perform well and remove the small file problem. This will keep the small file problem to a one time load.

Upvotes: 0

Related Questions