Spark (JAVA) - dataframe groupBy with multiple aggregations?

Question

I'm trying to write a groupBy on Spark with JAVA. In SQL this would look like

SELECT id, count(id) as count, max(date) maxdate
FROM table
GROUP BY id;

But what is the Spark/JAVA style equivalent of this query? Let's say the variable table is a dataframe, to see the relation to the SQL query. I'm thinking something like:

table = table.select(table.col("id"), (table.col("id").count()).as("count"), (table.col("date").max()).as("maxdate")).groupby("id")

Which is obviously incorrect, since you can't use aggregate functions like .count or .max on columns, only dataframes. So how is this done in Spark JAVA?

Thank you!

Yuan JI · Accepted Answer

You could do this with org.apache.spark.sql.functions:

import org.apache.spark.sql.functions;

table.groupBy("id").agg(
    functions.count("id").as("count"),
    functions.max("date").as("maxdate")
).show();

Spark (JAVA) - dataframe groupBy with multiple aggregations?

Answers (1)

Related Questions