Greg
Greg

Reputation: 151

Caching remote files with python

Background

We have a lot of data files stored on a network drive which we process in python. For performance reasons I typically copy the files to my local SSD when processing. My wish is to make this happen automatically, so whenever I try to open a file it will grab the remote version if it isn't stored locally, and ideally also keep some sort of timer to delete the files after some time. The files will practically never be changed so I do not require actual syncing capabilities.

Functionality

To sum up what I am looking for is functionality for:

It wouldn't be to difficult for me to write a piece of code which does this my self, but when possible, I prefer to rely on existing projects as this typically give a more versatile end result and also make any of my own improvements easily available to other users.

Question

I have searched a bit around for terms like python local file cache, file synchronization and the like, but what I have found mostly handles caching of function return values. I was a bit surprised because I would imagine this is a quite general problem, my question is therefore: is there something I have overlooked, and more importantly, are there any technical terms describing this functionality which could help my research.

Thank you in advance, Gregers Poulsen

-- Update --

Due to other proprietary software packages I am forced to use Windows so the solution naturally must support this.

Upvotes: 4

Views: 1442

Answers (1)

gerrit
gerrit

Reputation: 26535

Have a look at fsspec remote caching, with a tutorial on the anaconda blog and the official documentation. Quoting the former:

In this article, we will present [fsspec]s new ability to cache remote content, keeping a local copy for faster lookup after the initial read.

They give an example for how to use it:

import fsspec
of = fsspec.open("filecache://anaconda-public-datasets/iris/iris.csv", mode='rt', 
                 cache_storage='/tmp/cache1',
                 target_protocol='s3', target_options={'anon': True})
with of as f:
    print(f.readline())

On first call, the file will be downloaded, stored into cache, and provided. On the second call, it will be downloaded from the local filesystem.

I haven't used it yet, but I need it and it looks promising.

Upvotes: 2

Related Questions