Add index column to apache spark Dataset using java

Question

The below question has solution for scala and pyspark and the solution provided in this question is not for consecutive index values.

Spark Dataframe :How to add a index Column : Aka Distributed Data Index

I have an existing Dataset in Apache-spark and i want to select some rows from it based on the index. I am planning to add one index column that contains unique values staring from 1 and based on the values of that column i will fetch rows. I found below method to add index that uses order by:

df.withColumn("index", functions.row_number().over(Window.orderBy("a column")));

I do not want to use order by. I need index in the same order they are present in Dataset. Any help?

Oli · Accepted Answer

From what I gather, you are trying to add an index (with consecutive values) to a dataframe. Unfortunately, there is no built in function that does that in Spark. You can only add an increasing index (but not necessarily with consecutive values) with df.withColumn("index", monotonicallyIncreasingId).

Nonetheless, there exists a zipWithIndex function in the RDD API that does exactly what you need. We can thus define a function that transforms the dataframe into a RDD, adds the index and transforms it back into a dataframe.

I'm not an expert in spark in java (scala is much more compact) so it might be possible to do better. Here is how I would do it.

public static Dataset zipWithIndex(Dataset df, String name) {
    JavaRDD rdd = df.javaRDD().zipWithIndex().map(t -> {
        Row r = t._1;
        Long index = t._2 + 1;
        ArrayList

Add index column to apache spark Dataset<Row> using java

Answers (2)

Related Questions

Add index column to apache spark Dataset&lt;Row&gt; using java

Answers (2)

Related Questions

Add index column to apache spark Dataset<Row> using java