Reputation: 1939
I have a dataframe DF with the following structure :
ID, DateTime, Latitude, Longitude, otherArgs
I want to group my data by ID and time window, and keep information about the location (For example the mean of the grouped latitude and the mean of the grouped longitude)
I successfully got a new dataframe with data grouped by id and time using :
DF.groupBy($"ID",window($"DateTime","2 minutes")).agg(max($"ID"))
But I lose my location data doing that.
What I am looking for is something that would look like this for example:
DF.groupBy($"ID",window($"DateTime","2 minutes"),mean("latitude"),mean("longitude")).agg(max($"ID"))
Returning only one row for each ID and time window.
EDIT :
Sample input : DF : ID, DateTime, Latitude, Longitude, otherArgs
0 , 2018-01-07T04:04:00 , 25.000, 55.000, OtherThings
0 , 2018-01-07T04:05:00 , 26.000, 56.000, OtherThings
1 , 2018-01-07T04:04:00 , 26.000, 50.000, OtherThings
1 , 2018-01-07T04:05:00 , 27.000, 51.000, OtherThings
Sample output : DF : ID, window(DateTime), Latitude, Longitude
0 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 25.5, 55.5
1 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 26.5, 50.5
Upvotes: 1
Views: 62
Reputation: 23119
Here is what you can do, you need to use mean
with the aggregation
.
val df = Seq(
(0, "2018-01-07T04:04:00", 25.000, 55.000, "OtherThings"),
(0, "2018-01-07T04:05:00", 26.000, 56.000, "OtherThings"),
(1, "2018-01-07T04:04:00", 26.000, 50.000, "OtherThings"),
(1, "2018-01-07T04:05:00", 27.000, 51.000, "OtherThings")
).toDF("ID", "DateTime", "Latitude", "Longitude", "otherArgs")
//convert Sting to DateType for DateTime
.withColumn("DateTime", $"DateTime".cast(DateType))
df.groupBy($"id", window($"DateTime", "2 minutes"))
.agg(
mean("Latitude").as("lat"),
mean("Longitude").as("long")
)
.show(false)
Output:
+---+---------------------------------------------+----+----+
|id |window |lat |long|
+---+---------------------------------------------+----+----+
|1 |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|26.5|50.5|
|0 |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|25.5|55.5|
+---+---------------------------------------------+----+----+
Upvotes: 1
Reputation: 2909
You should use the .agg()
method for the aggregating
Perhaps this is what you mean?
DF
.groupBy(
'ID,
window('DateTime, "2 minutes")
)
.agg(
mean("latitude").as("latitudeMean"),
mean("longitude").as("longitudeMean")
)
Upvotes: 1