Reputation: 3282
I have a dataframe where I want to get the first row where the indicator
column is 0. For example, my dataframe will look like this:
network volume indicator Hour
YYY 20 1 10
YYY 30 0 9
YYY 40 0 8
YYY 80 1 7
TTT 50 0 10
TTT 40 1 8
TTT 10 0 4
TTT 10 1 2
The result should look like this:
network volume indicator Hour
YYY 20 1 10
YYY 30 0 9
YYY 80 1 7
TTT 50 0 10
TTT 40 1 8
TTT 10 1 2
So the ones with ones would still stay, while I get the first time the indicator was 0 for each network. I want to have everything sorted by hour in descending order when I do this, so I get the most recent 0 indicator. How do I go about achieving this result?
Upvotes: 0
Views: 141
Reputation: 454
Here is your required code, with comments inline to help you understand too: (updated the output with your latest dataset, with multiple 1's in the indicator column)
sourceData.show()
+-------+------+---------+----+
|network|volume|indicator|Hour|
+-------+------+---------+----+
| YYY| 20| 1| 10|
| YYY| 30| 0| 9|
| YYY| 40| 0| 8|
| YYY| 80| 1| 7|
| TTT| 50| 0| 10|
| TTT| 40| 1| 8|
| TTT| 10| 0| 4|
| TTT| 10| 1| 2|
+-------+------+---------+----+
sourceData.printSchema()
root
|-- network: string (nullable = true)
|-- volume: integer (nullable = true)
|-- indicator: integer (nullable = true)
|-- Hour: integer (nullable = true)
Required Transformation Code:
//splitting your data set into two parts with indicator 1 and 0
val indicator1Df = sourceData.filter("indicator == 1")
val indicator0Df = sourceData.filter("indicator == 0")
//getting the first row for all indicator=0
indicator0Df.createOrReplaceTempView("indicator0")
val firstIndicator0df = spark.sql("select network, volume, indicator, hour from (select i0.network,i0.volume,i0.indicator,i0.hour,ROW_NUMBER() over (partition by i0.network order by i0.Hour desc) as rnk from indicator0 i0) i where rnk = 1")
//merging both the dataframes back to for your required output result
val finalDf = indicator1Df.union(firstIndicator0df).orderBy($"network".desc,$"Hour".desc)
finalDf.show()
Final Output:
+-------+------+---------+----+
|network|volume|indicator|Hour|
+-------+------+---------+----+
| YYY| 20| 1| 10|
| YYY| 30| 0| 9|
| YYY| 80| 1| 7|
| TTT| 50| 0| 10|
| TTT| 40| 1| 8|
| TTT| 10| 1| 2|
+-------+------+---------+----+
Upvotes: 1