Wael Amri
Wael Amri

Reputation: 1

FInd first non zero element in pyspark dataframe

I am working with a pyspark dataframe and trying to see if there is a method that can extract me the index of first non zero element in spark dataframe. I have added the index column myself since pyspark does not support that, as opposed to pandas.

Upvotes: 0

Views: 1240

Answers (1)

Steven
Steven

Reputation: 15258

let's assume your dataframe looks like this :

df.show()
+---+-----+                                                                     
|idx|value|
+---+-----+
|  0|    0|
|  1|    0|
|  2|    1|  # <-- We want this one
|  3|    2|
|  4|    3|
|  5|    4|
+---+-----+

you can achieve this easily with a min:

from pyspark.sql import functions as F

df.where(F.col("value") != 0).select(F.min("idx")).show()

or with a row_number

from pyspark.sql import functions as F, Window

df.where(F.col("value") != 0).withColumn(
    "rwnb", F.row_number().over(Window.orderBy("idx"))
).where(F.col("rwnb") == 1).select("idx").show()
+---+
|idx|
+---+
|  2|
+---+

Upvotes: 1

Related Questions