RichaDwivedi
RichaDwivedi

Reputation: 343

How to achieve Rownum feature in Spark Dataframe similar to oracle rownum

I am new to spark and I am trying to create a rownumer for large dataset. I tried using row_number window function, which works fine but its not performance efficient as I am not using partitionBy clause.

Eg:

 val df= Seq(
        ("041", false),
        ("042", false),
        ("043", false)
      ).toDF("id", "flag")

Result should be :

val df= Seq(
        ("041", false,1),
        ("042", false,2),
        ("043", false,3)
      ).toDF("id", "flag","rownum")

currently I am using

df.withColumn("rownum",row_number().over(Window.orderBy($"id")))

Is there any other way to achieve this result without using window functions? I also tried monotonicallyIncresingID and ZipwithIndex

Upvotes: 1

Views: 4728

Answers (1)

Rajat Mishra
Rajat Mishra

Reputation: 3770

You can use monotonicallyIncreasingId to get a rowNum feature

val df2 = df.withColumn("rownum",monotonicallyIncreasingId)

here the index would start with 0.

to start index with 1, one add +1 to the monotonicallyIncreasingId

val df2 = df.withColumn("rownum",monotonicallyIncreasingId+1)

scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> var df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+

Upvotes: 1

Related Questions