Gaurav Bansal
Gaurav Bansal

Reputation: 5660

Get row indices based on condition in Spark

Let's say I have a Spark DataFrame as shown below. How can I get the row indices where value is 0?

ID  | value
-------------
001 | 1
002 | 0
003 | 2
004 | 0
005 | 1

Row indices I want are 2 and 4.

Upvotes: 3

Views: 10843

Answers (4)

Robin
Robin

Reputation: 695

val index = ss.sparkContext
    .parallelize(Seq((1,1), (2,0), (3,2), (4,0), (5,1)))
    .toDF("ID", "value")

index.where($"value" === 0).select("ID").show()

Upvotes: -1

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

You can use filter and select to get the indexes you want as

Given a dataframe as

+---+-----+
|ID |value|
+---+-----+
|001|1    |
|002|0    |
|003|2    |
|004|0    |
|005|1    |
+---+-----+

you can do the following

df.filter(df.value == 0).select(df.ID)

which should give you

+---+
|ID |
+---+
|002|
|004|
+---+

you can use .flatMap(lambda x: x).collect() to convert the above selected column dataframe to list

I hope the answer is helpful

Upvotes: 5

aku
aku

Reputation: 465

There are ways you can do this, I'm thinking rdd.zipwithindex() and filtering. But why do you need to do this? Attempting to use row indices is generally discouraged--what is the eventual goal you are attempting to accomplish with these indices? There might be a better way of doing this.

Upvotes: 0

user8480446
user8480446

Reputation: 9

There is no such thing as indices in Spark DataFrame. Same as SQL tables, DataFrame is unordered, unless sorted specifically.

There is a row_number window function, but it is not intended for global orderings.

Overall, if you think about the order, you probably approach Spark from the wrong direction.

Upvotes: 0

Related Questions