Reputation: 7586
I'm trying to write a groupBy on Spark with JAVA. In SQL this would look like
SELECT id, count(id) as count, max(date) maxdate
FROM table
GROUP BY id;
But what is the Spark/JAVA style equivalent of this query? Let's say the variable table
is a dataframe, to see the relation to the SQL query. I'm thinking something like:
table = table.select(table.col("id"), (table.col("id").count()).as("count"), (table.col("date").max()).as("maxdate")).groupby("id")
Which is obviously incorrect, since you can't use aggregate functions like .count
or .max
on columns, only dataframes. So how is this done in Spark JAVA?
Thank you!
Upvotes: 10
Views: 11076
Reputation: 2995
You could do this with org.apache.spark.sql.functions
:
import org.apache.spark.sql.functions;
table.groupBy("id").agg(
functions.count("id").as("count"),
functions.max("date").as("maxdate")
).show();
Upvotes: 26