Reputation: 45
I am trying to run an spark sql statement and trying to do a simple group by while doing an aggregation however; it is complaining that it can't find the column month in the given input columns which I provided in a schema, but following a tutorial they were able to run the given code.
Code:
StructField[] fields = new StructField[]{
new StructField("level", DataTypes.StringType, false, Metadata.empty()),
new StructField("datetime", DataTypes.StringType, false, Metadata.empty())
};
StructType schema = new StructType(fields);
Dataset<Row> dateSet = spark.createDataFrame(inMemory, schema);
dateSet.createOrReplaceTempView("logging_level");
Dataset<Row> results = spark.sql("select level, date_format(datetime, 'MMMM') as month, count(1) as total from logging_level group by level, month");
stack Trace:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`month`' given input columns: [level, datetime]; line 1 pos 107
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)....
Upvotes: 1
Views: 151
Reputation: 222462
You cannot reuse an alias defined in the select
clause in the group by
clause. You need to repeat the expression:
select level, date_format(datetime, 'MMMM') as month, count(*) as total
from logging_level
group by level, date_format(datetime, 'MMMM')
Note that I replaced count(1)
with count(*)
: it is more efficient, and gives you the same result.
Many databases support positional parameters. I think that Spark is one of them, so:
select level, date_format(datetime, 'MMMM') as month, count(*) as total
from logging_level
group by 1, 2
Upvotes: 1