Reputation: 51
Is it possible to change the key of a key-value pair ? I load from different folders and the key is currently just the path to the file, but I want to change the key into a Integer depending from which folder the file comes.
dir_pair_data = sc.wholeTextFiles(mypath)
dir_pair_data = dir_pair_data.map(lambda (x,y) : os.path.dirname(x),y )
of course this doesn't work... does anyone have a hint for me, I'm pretty new to spark an python...
Upvotes: 2
Views: 6718
Reputation: 2442
I believe the following piece of code accomplishes what you want in terms of keying each set of files by a unique ID corresponding to its parent directory (though admittedly, it could be optimized since I'm a little new to pyspark myself):
dir_pair_data = sc.wholeTextFiles(mypath)
dir_pair_data = (dir_pair_data
.map(lambda (x,y): (os.path.dirname(x), y))
.groupByKey()
.values()
.zipWithUniqueId()
.map(lambda x:(x[1], x[0]))
.flatMapValues(lambda x: x))
As a summary of the steps:
Long
identifier to each grouped set of text filesLong
idUpvotes: 2