Reputation: 930
is there any way to group by table in sql spark which selects multiple elements code i am using:
val df = spark.read.json("//path")
df.createOrReplaceTempView("GETBYID")
now doing group by like :
val sqlDF = spark.sql(
"SELECT count(customerId) FROM GETBYID group by customerId");
but when I try:
val sqlDF = spark.sql(
"SELECT count(customerId),customerId,userId FROM GETBYID group by customerId");
Spark gives an error :
org.apache.spark.sql.AnalysisException: expression 'getbyid.
userId
' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
is there any possible way to do that
Upvotes: 2
Views: 18502
Reputation: 2621
And if you want to keep all the occurences of userId, you can do this :
spark.sql("SELECT count(customerId), customerId, collect_list(userId) FROM GETBYID group by customerId")
By using collect_list.
Upvotes: 0
Reputation: 13926
Yes, it's possible and the error message you attached describes all the possibilities. You can either add the userId
to groupBy:
val sqlDF = spark.sql("SELECT count(customerId),customerId,userId FROM GETBYID group by customerId, userId");
or use first()
:
val sqlDF = spark.sql("SELECT count(customerId),customerId,first(userId) FROM GETBYID group by customerId");
Upvotes: 5