Spark Delete Rows

Question

I have a DataFrame containing roughly 20k rows.

I want to delete 186 rows randomly in the dataset.

To understand the context - I am testing a classification model on missing data, and each row has a unix timestamp. 186 rows corresponds to 3 seconds (there are 62 rows of data per second.)

My aim for this is, when data is streaming, it is likely that data will be missing for a number of seconds. I am extracting features from a time window, so I want to see how missing data effects model performance.

I think the best approach to this would be to convert to an rdd and use the filter function, something like this, and put the logic inside the filter function.

dataFrame.rdd.zipWithIndex().filter(lambda x: )

But I am stuck with the logic - how do I implement this? (using PySpark)

Thiago Baldim · Accepted Answer

Try to do like this:

import random
startVal = random.randint(0,dataFrame.count() - 62)
dataFrame.rdd.zipWithIndex()\
             .filter(lambda x: not x[<>] in range(startVal, startVal+62))

This should work!

Spark Delete Rows

Answers (1)

Related Questions