Pyspark-length of an element and how to use it later

Question

So I have a dataset of words, I try to keep only those that are longer than 6 characters:

data=dataset.map(lambda word: word,len(word)).filter(len(word)>=6)

When:

print data.take(10)

it returns all of the words, including the first 3, which have length lower than 6. I dont actually want to print them, but to continue working on the data that have length greater than 6.

So when I will have the appropriate dataset, I would like to be able to select the data that I need, for example the ones that have length less than 15 and be able to make computations on them.

Or even to apply a function on the "word".

Any ideas??

ccheneson · Accepted Answer

What you want is something along this (untested):

data=dataset.map(lambda word: (word,len(word))).filter(lambda t : t[1] >=6)

In the map, you return a tuple of (word, length of word) and the filter will look at the length of word (the l) to take only the (w,l) whose l is greater or equal to 6

Pyspark-length of an element and how to use it later

Answers (1)

Related Questions