Reputation: 382
I have folders with many many files (e.g. over 100k), some files small (less than 1kb) and some files big (e.g. several MBs).
I would like to use pyspark and scan all the files under these folders, e.g. "C:\Xiang". The file names are, for example, Folder 1:
C:\Xiang\fold1\filename1.txt
C:\Xiang\fold1\filename2.txt
C:\Xiang\fold1\filename3.txt
C:\Xiang\fold1\filename1_.meta.txt
C:\Xiang\fold1\filename2_.meta.txt
...
"fold2", "fold3", ... have similarly structure.
I would like to scan all the files under these folders and get the modification time of each file. Ideally, it can be saved into a RDD, with pair as (key, value) with key the filename (e.g. C:\Xiang\filename1.txt) and value the modification time (e.g. 2020-12-16 13:40). So that I could perform further operation on these files, e.g. filter by the modification time and open the selected files. ...
Any idea?
Upvotes: 0
Views: 194
Reputation: 42352
Use pathlib
to get the last modified time and map onto your rdd of file names:
import os
import pathlib
rdd = sc.parallelize(os.listdir("C:\Xiang")) # try slash if backslash doesn't work
rdd2 = rdd.keyBy(lambda x: x).map(lambda f: (f[0], pathlib.Path(f[1]).stat().st_mtime))
Upvotes: 1