Reputation: 2659
Is it possible to delete nth row from a dataframe without using collect
and then converting back to dataFrame. I want to avoid using collect as I have a large dataset.
val arr=df.collect().toBuffer
arr.remove(13)
May be somehow I can convert back to dataframe.Is there a easier way? I tried zipwithIndex but dataFrame doesn't support zipwithIndex
value zipWithIndex is not a member of org.apache.spark.sql.DataFrame
Upvotes: 1
Views: 6583
Reputation: 2155
In Spark term's i would say transforming a RDD is better than converting it. Here is one example that suggest to use filter method to do this much efficiently. You would definately need to have index column for this example.
import org.apache.spark.sql._
val list = Seq(("one", 1), ("two", 2), ("three", 3),("four", 4),("five", 5))
val sqlContext = new SQLContext(sc)
val numdf = sqlContext.createDataFrame(list)
numdf.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: integer (nullable = false)
newdf = numdf.filter(numdf("_2")<2 or numdf("_2")>2).show()
Here is my #bluemix notebook.
Thanks,
Charles.
Upvotes: 0
Reputation: 37832
DataFrame doesn't support this as far as I know, you'll need to use the RDD API. You can convert back to DataFrame right after.
Note that this is very different from using collect
which copies all data to your driver.
val filteredRdd = input.rdd.zipWithIndex().collect { case (r, i) if i != 13 => r }
val newDf = sqlContext.createDataFrame(filteredRdd, input.schema)
(the collect
used here isn't the one that collects data to driver, it applies a partial function to do the filtering and the mapping in one call).
Disclaimer : Please remember that DataFrames in Spark are like RDD in the sense that they’re an immutable data structure. Therefore things like creating a new column or removing a row, or trying to access by index a single element within a DataFrame can’t exist, just because this kind of affectation goes against the principles of Spark. Don’t forget that you’re using a distributed data structure, not an in-memory random-access data structure.
To be clear, this doesn’t mean that you can’t do the same kind of thing (i.e. create a new column) using Spark, it means that you have to think immutable/distributed and re-write parts of your code, mostly the parts that are not purely thought of as transformations on a stream of data.
Upvotes: 3