LeonardBlunderbuss
LeonardBlunderbuss

Reputation: 1274

Does DistributedCache remove cached files after every job?

The documentation for DistributedCache states:

Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

What does it mean when it says it can "cache archives that are un-archived on the slaves"? Are cached files removed after every job? I would like to be able to run the same job hundreds of times on different data sets without the added overhead of re-distributing the DistributedCache files before every single job. Is this possible?

Upvotes: 4

Views: 463

Answers (1)

rVr
rVr

Reputation: 1331

Hadoop keeps a reference count on how many tasks are using the files in the DistributedCache. If the count drops to 0, then the file marked for deletion. So, at the end of the job the files in the DistributedCache are cleaned or else they would keep on piling on the node across jobs.

Upvotes: 2

Related Questions