Reputation: 229
So I have a dataset of words, I try to keep only those that are longer than 6 characters:
data=dataset.map(lambda word: word,len(word)).filter(len(word)>=6)
When:
print data.take(10)
it returns all of the words, including the first 3, which have length lower than 6. I dont actually want to print them, but to continue working on the data that have length greater than 6.
So when I will have the appropriate dataset, I would like to be able to select the data that I need, for example the ones that have length less than 15 and be able to make computations on them.
Or even to apply a function on the "word".
Any ideas??
Upvotes: 0
Views: 2656
Reputation: 49410
What you want is something along this (untested):
data=dataset.map(lambda word: (word,len(word))).filter(lambda t : t[1] >=6)
In the map
, you return a tuple of (word, length of word)
and the filter
will look at the length of word (the l
) to take only the (w,l)
whose l
is greater or equal to 6
Upvotes: 1