Koksi
Koksi

Reputation: 51

Spark: Change the Key of key value pair

Is it possible to change the key of a key-value pair ? I load from different folders and the key is currently just the path to the file, but I want to change the key into a Integer depending from which folder the file comes.

dir_pair_data = sc.wholeTextFiles(mypath)
dir_pair_data = dir_pair_data.map(lambda (x,y) : os.path.dirname(x),y )

of course this doesn't work... does anyone have a hint for me, I'm pretty new to spark an python...

Upvotes: 2

Views: 6718

Answers (1)

Rohan Aletty
Rohan Aletty

Reputation: 2442

I believe the following piece of code accomplishes what you want in terms of keying each set of files by a unique ID corresponding to its parent directory (though admittedly, it could be optimized since I'm a little new to pyspark myself):

dir_pair_data = sc.wholeTextFiles(mypath)
dir_pair_data = (dir_pair_data
                 .map(lambda (x,y): (os.path.dirname(x), y))
                 .groupByKey()
                 .values()
                 .zipWithUniqueId()
                 .map(lambda x:(x[1], x[0]))
                 .flatMapValues(lambda x: x))

As a summary of the steps:

  1. map -- places the key-value pairs into tuples, converting the key into the parent directory
  2. groupByKey -- groups all text files by the corresponding parent directory
  3. values -- sheds the parent directory element and returns only the grouped text files
  4. zipWithUniqueId -- provides a unique Long identifier to each grouped set of text files
  5. map -- swaps elements so the key is the Long id
  6. flatMapValues -- flattens the grouped text files so that each file is contained within its own own record

Upvotes: 2

Related Questions