Reputation: 343
I am new to spark and I am trying to create a rownumer for large dataset. I tried using row_number window function, which works fine but its not performance efficient as I am not using partitionBy clause.
Eg:
val df= Seq(
("041", false),
("042", false),
("043", false)
).toDF("id", "flag")
Result should be :
val df= Seq(
("041", false,1),
("042", false,2),
("043", false,3)
).toDF("id", "flag","rownum")
currently I am using
df.withColumn("rownum",row_number().over(Window.orderBy($"id")))
Is there any other way to achieve this result without using window functions? I also tried monotonicallyIncresingID and ZipwithIndex
Upvotes: 1
Views: 4728
Reputation: 3770
You can use monotonicallyIncreasingId
to get a rowNum feature
val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
here the index would start with 0.
to start index with 1, one add +1 to the monotonicallyIncreasingId
val df2 = df.withColumn("rownum",monotonicallyIncreasingId+1)
scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]
scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false| 0|
|042|false| 1|
|043|false| 2|
+---+-----+------+
scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]
scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false| 0|
|042|false| 1|
|043|false| 2|
+---+-----+------+
scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]
scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false| 0|
|042|false| 1|
|043|false| 2|
+---+-----+------+
scala> var df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]
scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false| 0|
|042|false| 1|
|043|false| 2|
+---+-----+------+
scala> df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]
scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false| 0|
|042|false| 1|
|043|false| 2|
+---+-----+------+
Upvotes: 1