Aravind Yarram
Aravind Yarram

Reputation: 80176

Spark SQL: Error when using aliased column in SELECT clause

I have ORDERS and SELLERS tables with the following schema:

scala> spark.sql("desc ORDERS").show()
+---------------+---------+-------+
|       col_name|data_type|comment|
+---------------+---------+-------+
|       order_id|   string|   null|
|     product_id|   string|   null|
|      seller_id|   string|   null|
|           date|   string|   null|
|num_pieces_sold|   string|   null|
|  bill_raw_text|   string|   null|
+---------------+---------+-------+


scala> spark.sql("desc SELLERS").show()
+------------+---------+-------+
|    col_name|data_type|comment|
+------------+---------+-------+
|   seller_id|   string|   null|
| seller_name|   string|   null|
|daily_target|   string|   null|
+------------+---------+-------+

I am getting an error with the following query. I understand we can use subquery (or withColumn) to avoid this but wondering if there is an alternate way.

scala> 
q3: String =
"
 SELECT
    o.seller_id,
    o.num_pieces_sold/s.daily_target AS ratio,
    avg(ratio) AS percent_contribution
 FROM orders o
 JOIN sellers s ON o.seller_id = s.seller_id
 GROUP BY o.seller_id
 "

scala> spark.sql(q3).show()
org.apache.spark.sql.AnalysisException: cannot resolve '`ratio`' given input columns: [o.bill_raw_text, s.seller_id, o.date, o.seller_id, o.order_id, o.product_id, s.seller_name, o.num_pieces_sold, s.daily_target]; line 5 pos 6;
'Aggregate [seller_id#8], [seller_id#8, (cast(num_pieces_sold#10 as double) / cast(daily_target#20 as double)) AS ratio#421, 'avg('ratio) AS percent_contribution#422]
+- Join Inner, (seller_id#8 = seller_id#18)
   :- SubqueryAlias `o`
   :  +- SubqueryAlias `orders`
   :     +- Relation[order_id#6,product_id#7,seller_id#8,date#9,num_pieces_sold#10,bill_raw_text#11] parquet
   +- SubqueryAlias `s`
      +- SubqueryAlias `sellers`
         +- Relation[seller_id#18,seller_name#19,daily_target#20] parquet

Upvotes: 1

Views: 1336

Answers (1)

Ram Ghadiyaram
Ram Ghadiyaram

Reputation: 29165

seems like catalyst is not recognizing (by design) the intermediate expression ratio and again it was used in another column like normal sql

direct usage with avg function is working like below with out aliasing the ratio

package examples

import org.apache.log4j.Level
import org.apache.spark.sql.{SaveMode, SparkSession}

object Ratio extends App {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)

  val spark: SparkSession = SparkSession.builder.appName(getClass.getName).enableHiveSupport()
    .master("local[*]").getOrCreate

  import spark._
  import spark.implicits._


  sql(
    """
      |CREATE TABLE IF NOT EXISTS orders
      |(
      |  order_id string
      |, product_id String
      |, seller_id String
      |, date int
      |, num_pieces_sold String
      |, bill_raw_text String
      |)
    """.stripMargin)


  sql(
    """
      |CREATE TABLE IF NOT EXISTS sellers
      |(
      |  seller_id string
      |, seller_name String
      |, daily_target String )
    """.stripMargin)

  sql("SHOW tables ").show

  val df1 = Seq(
    ("1", "1", "1", 20200520, 10, "rawtext"),
    ("2", "2", "2", 20200521, 11, "rawtext1")
  ).toDF("order_id", "product_id", "seller_id", "date", "num_pieces_sold", "bill_raw_text")

  df1.write.mode(SaveMode.Overwrite).saveAsTable("orders")
  df1.show

  val df2 = Seq(
    ("1", "seller_name1", "11"),
    ("2", "seller_name2", "22")
  ).toDF("seller_id", "seller_name", "daily_target")

  df2.write.mode(SaveMode.Overwrite).saveAsTable("sellers")
  df2.show


  val q3 =
    """
      | SELECT
      |    o.seller_id,
      |    avg(o.num_pieces_sold/ s.daily_target) AS percent_contribution
      | FROM orders o
      | JOIN sellers s ON o.seller_id = s.seller_id
      |  GROUP BY o.seller_id
    """.stripMargin

  spark.sql(q3).show


}


Result :

+--------+-------------+-----------+
|database|    tableName|isTemporary|
+--------+-------------+-----------+
 default|       orders|      false|
| default|      sellers|      false|
+--------+-------------+-----------+

20/05/18 21:08:29 INFO log: Updating table stats fast for orders
20/05/18 21:08:29 INFO log: Updated size of table orders to 2949
+--------+----------+---------+--------+---------------+-------------+
|order_id|product_id|seller_id|    date|num_pieces_sold|bill_raw_text|
+--------+----------+---------+--------+---------------+-------------+
|       1|         1|        1|20200520|             10|      rawtext|
|       2|         2|        2|20200521|             11|     rawtext1|
+--------+----------+---------+--------+---------------+-------------+


+---------+------------+------------+
|seller_id| seller_name|daily_target|
+---------+------------+------------+
|        1|seller_name1|          11|
|        2|seller_name2|          22|
+---------+------------+------------+

+---------+--------------------+
|seller_id|percent_contribution|
+---------+--------------------+
|        1|  0.9090909090909091|
|        2|                 0.5|
+---------+--------------------+

Upvotes: 3

Related Questions