Reputation: 117
I have dataFrame
as below and want to add remarks using Scala
id val visits
111 2 1
111 2 1
112 4 2
112 5 4
113 6 1
Expected output should be below
id val visits remarks
111 2 1 Ramdom
111 2 1 Ramdom
112 4 2 Less visit
112 5 4 More visit
113 6 1 One visit
Remarks should be:
Ramdom for Id has two records with same value & visits
One Visit for Id has only one record which contains any no of visits
Less Visit for Id has two records with less visits compared to other
More Visit for Id has more than one record with different value and visits.
Upvotes: 0
Views: 70
Reputation: 7928
May not be the best solution but it's a working one:
First group your dataFrame
by val
and visits
and the count of them
val grouped = df.groupBy("id").agg(max($"val").as("maxVal"), max($"visits").as("maxVisits"), min($"val").as("minVal"), min($"visits").as("minVisits"), count($"id").as("count"))
Then define a UDF
where you implement your logic:
val remarks = functions.udf ((value: Int, visits: Int, maxValue: Int, maxVisits: Int, minValue: Int, minVisits: Int, count: Int) =>
if (count == 1) {
"One Visit"
}else if (value == maxValue && value == minValue && visits == maxVisits && visits == minVisits) {
"Random"
}else {
if (visits < maxVisits) {
"Less Visits"
}else {
"More Visits"
}
}
)
Then join the original dataFrame
and the grouped one by id
and add the desired column with the UDF
. Finally remove the undesired columns from the output:
df.join(grouped, Seq("id"))
.withColumn("remarks", remarks($"val", $"visits", $"maxVal", $"maxVisits", $"minVal", $"minVisits", $"count"))
.drop("maxVal","maxVisits", "minVal", "minVisits", "count")
Output:
+---+----+-------+-----------+
| id| val| visits| remarks|
+---+----+-------+-----------+
|112| 4| 2|Less Visits|
|112| 5| 4|More Visits|
|113| 6| 1| One Visit|
|111| 2| 1| Random|
|111| 2| 1| Random|
+---+----+-------+-----------+
P.S. remember to import functions
import org.apache.spark.sql.functions
Upvotes: 2