Reputation: 849
I'm testing a classifier on missing data and want to randomly delete rows in Spark.
I want to do something like for every nth row, delete 20 rows.
What would be the best way to do this?
Upvotes: 2
Views: 1776
Reputation: 18022
If it is random you can use sample this method lets you take a fraction of a DataFrame
. However, if your idea is to split your data into training
and validation
you can use randomSplit.
Another option which is less elegant is to convert your DataFrame
into an RDD
and use zipWithIndex and filter by index
, maybe something like:
df.rdd.zipWithIndex().filter(lambda x: x[-1] % 20 != 0)
Upvotes: 3