ashender
ashender

Reputation: 45

Apache SparkSQL unable to resolve given input columns that are created within the sqlText

I am trying to run an spark sql statement and trying to do a simple group by while doing an aggregation however; it is complaining that it can't find the column month in the given input columns which I provided in a schema, but following a tutorial they were able to run the given code.

Code:

    StructField[] fields = new StructField[]{
            new StructField("level", DataTypes.StringType, false, Metadata.empty()),
            new StructField("datetime", DataTypes.StringType, false, Metadata.empty())
    };

    StructType schema = new StructType(fields);
    Dataset<Row> dateSet = spark.createDataFrame(inMemory, schema);
    dateSet.createOrReplaceTempView("logging_level");
    Dataset<Row> results = spark.sql("select level, date_format(datetime, 'MMMM') as month, count(1) as total from logging_level group by level, month");

stack Trace:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`month`' given input columns: [level, datetime]; line 1 pos 107
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)....

Upvotes: 1

Views: 151

Answers (1)

GMB
GMB

Reputation: 222462

You cannot reuse an alias defined in the select clause in the group by clause. You need to repeat the expression:

select level, date_format(datetime, 'MMMM') as month, count(*) as total 
from logging_level 
group by level, date_format(datetime, 'MMMM')

Note that I replaced count(1) with count(*): it is more efficient, and gives you the same result.

Many databases support positional parameters. I think that Spark is one of them, so:

select level, date_format(datetime, 'MMMM') as month, count(*) as total 
from logging_level 
group by 1, 2

Upvotes: 1

Related Questions