Reputation: 494
I have a DataFrame containing roughly 20k rows.
I want to delete 186 rows randomly in the dataset.
To understand the context - I am testing a classification model on missing data, and each row has a unix timestamp. 186 rows corresponds to 3 seconds (there are 62 rows of data per second.)
My aim for this is, when data is streaming, it is likely that data will be missing for a number of seconds. I am extracting features from a time window, so I want to see how missing data effects model performance.
I think the best approach to this would be to convert to an rdd
and use the filter
function, something like this, and put the logic inside the filter function.
dataFrame.rdd.zipWithIndex().filter(lambda x: )
But I am stuck with the logic - how do I implement this? (using PySpark)
Upvotes: 1
Views: 3035
Reputation: 7742
Try to do like this:
import random
startVal = random.randint(0,dataFrame.count() - 62)
dataFrame.rdd.zipWithIndex()\
.filter(lambda x: not x[<<index>>] in range(startVal, startVal+62))
This should work!
Upvotes: 3