Reputation: 5660
Let's say I have a Spark DataFrame as shown below. How can I get the row indices where value
is 0?
ID | value
-------------
001 | 1
002 | 0
003 | 2
004 | 0
005 | 1
Row indices I want are 2 and 4.
Upvotes: 3
Views: 10843
Reputation: 695
val index = ss.sparkContext
.parallelize(Seq((1,1), (2,0), (3,2), (4,0), (5,1)))
.toDF("ID", "value")
index.where($"value" === 0).select("ID").show()
Upvotes: -1
Reputation: 41957
You can use filter
and select
to get the indexes you want as
Given a dataframe
as
+---+-----+
|ID |value|
+---+-----+
|001|1 |
|002|0 |
|003|2 |
|004|0 |
|005|1 |
+---+-----+
you can do the following
df.filter(df.value == 0).select(df.ID)
which should give you
+---+
|ID |
+---+
|002|
|004|
+---+
you can use .flatMap(lambda x: x).collect()
to convert the above selected column dataframe
to list
I hope the answer is helpful
Upvotes: 5
Reputation: 465
There are ways you can do this, I'm thinking rdd.zipwithindex() and filtering. But why do you need to do this? Attempting to use row indices is generally discouraged--what is the eventual goal you are attempting to accomplish with these indices? There might be a better way of doing this.
Upvotes: 0
Reputation: 9
There is no such thing as indices in Spark DataFrame
. Same as SQL tables, DataFrame
is unordered, unless sorted specifically.
There is a row_number
window function, but it is not intended for global orderings.
Overall, if you think about the order, you probably approach Spark from the wrong direction.
Upvotes: 0