Nakeuh
Nakeuh

Reputation: 1939

Dataframe Aggregation

I have a dataframe DF with the following structure :

ID, DateTime, Latitude, Longitude, otherArgs

I want to group my data by ID and time window, and keep information about the location (For example the mean of the grouped latitude and the mean of the grouped longitude)

I successfully got a new dataframe with data grouped by id and time using :

DF.groupBy($"ID",window($"DateTime","2 minutes")).agg(max($"ID"))

But I lose my location data doing that.

What I am looking for is something that would look like this for example:

DF.groupBy($"ID",window($"DateTime","2 minutes"),mean("latitude"),mean("longitude")).agg(max($"ID"))

Returning only one row for each ID and time window.

EDIT :

Sample input : DF : ID, DateTime, Latitude, Longitude, otherArgs

0 , 2018-01-07T04:04:00 , 25.000, 55.000, OtherThings
0 , 2018-01-07T04:05:00 , 26.000, 56.000, OtherThings
1 , 2018-01-07T04:04:00 , 26.000, 50.000, OtherThings
1 , 2018-01-07T04:05:00 , 27.000, 51.000, OtherThings

Sample output : DF : ID, window(DateTime), Latitude, Longitude

0 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 25.5, 55.5
1 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 26.5, 50.5

Upvotes: 1

Views: 62

Answers (2)

koiralo
koiralo

Reputation: 23119

Here is what you can do, you need to use mean with the aggregation.

val df = Seq(
  (0, "2018-01-07T04:04:00", 25.000, 55.000, "OtherThings"),
  (0, "2018-01-07T04:05:00", 26.000, 56.000, "OtherThings"),
  (1, "2018-01-07T04:04:00", 26.000, 50.000, "OtherThings"),
  (1, "2018-01-07T04:05:00", 27.000, 51.000, "OtherThings")
).toDF("ID", "DateTime", "Latitude", "Longitude", "otherArgs")
//convert Sting to DateType for DateTime
.withColumn("DateTime", $"DateTime".cast(DateType))

df.groupBy($"id", window($"DateTime", "2 minutes"))
  .agg(
    mean("Latitude").as("lat"),
    mean("Longitude").as("long")
  )
.show(false)

Output:

+---+---------------------------------------------+----+----+
|id |window                                       |lat |long|
+---+---------------------------------------------+----+----+
|1  |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|26.5|50.5|
|0  |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|25.5|55.5|
+---+---------------------------------------------+----+----+

Upvotes: 1

Tom Lous
Tom Lous

Reputation: 2909

You should use the .agg() method for the aggregating

Perhaps this is what you mean?

DF
  .groupBy(
    'ID,
    window('DateTime, "2 minutes")
  )
  .agg(
    mean("latitude").as("latitudeMean"),
    mean("longitude").as("longitudeMean")        
  )

Upvotes: 1

Related Questions