other15
other15

Reputation: 849

Spark randomly drop rows

I'm testing a classifier on missing data and want to randomly delete rows in Spark.

I want to do something like for every nth row, delete 20 rows.

What would be the best way to do this?

Upvotes: 2

Views: 1776

Answers (1)

Alberto Bonsanto
Alberto Bonsanto

Reputation: 18022

If it is random you can use sample this method lets you take a fraction of a DataFrame. However, if your idea is to split your data into training and validation you can use randomSplit.

Another option which is less elegant is to convert your DataFrame into an RDD and use zipWithIndex and filter by index, maybe something like:

df.rdd.zipWithIndex().filter(lambda x: x[-1] % 20 != 0)

Upvotes: 3

Related Questions