Spark Scala Dataframe: Delete nth Record

Is it possible to delete nth row from a dataframe without using collect and then converting back to dataFrame. I want to avoid using collect as I have a large dataset.

val arr=df.collect().toBuffer
arr.remove(13)

May be somehow I can convert back to dataframe.Is there a easier way? I tried zipwithIndex but dataFrame doesn't support zipwithIndex

value zipWithIndex is not a member of org.apache.spark.sql.DataFrame

Upvotes: 1

Answers (2)

charles gomes

Reputation: 2155

In Spark term's i would say transforming a RDD is better than converting it. Here is one example that suggest to use filter method to do this much efficiently. You would definately need to have index column for this example.

import org.apache.spark.sql._

val list = Seq(("one", 1), ("two", 2), ("three", 3),("four", 4),("five", 5))
val sqlContext = new SQLContext(sc)

val numdf = sqlContext.createDataFrame(list)
numdf.printSchema()

root
 |-- _1: string (nullable = true)
 |-- _2: integer (nullable = false)

newdf = numdf.filter(numdf("_2")<2 or numdf("_2")>2).show()

Here is my #bluemix notebook.

Thanks,

Charles.

Upvotes: 0

Tzach Zohar

Reputation: 37832

DataFrame doesn't support this as far as I know, you'll need to use the RDD API. You can convert back to DataFrame right after.

Note that this is very different from using collect which copies all data to your driver.

val filteredRdd = input.rdd.zipWithIndex().collect { case (r, i) if i != 13 => r }
val newDf = sqlContext.createDataFrame(filteredRdd, input.schema)

(the collect used here isn't the one that collects data to driver, it applies a partial function to do the filtering and the mapping in one call).

Disclaimer : Please remember that DataFrames in Spark are like RDD in the sense that they’re an immutable data structure. Therefore things like creating a new column or removing a row, or trying to access by index a single element within a DataFrame can’t exist, just because this kind of affectation goes against the principles of Spark. Don’t forget that you’re using a distributed data structure, not an in-memory random-access data structure.

To be clear, this doesn’t mean that you can’t do the same kind of thing (i.e. create a new column) using Spark, it means that you have to think immutable/distributed and re-write parts of your code, mostly the parts that are not purely thought of as transformations on a stream of data.

Upvotes: 3

Spark Scala Dataframe: Delete nth Record

Answers (2)

Related Questions