Jeston
Jeston

Reputation: 396

pyspark: select column using an alias

Im trying to do a simple select from an alias using SQLContext.sql in spark 1.6.

sqlCtx = SQLContext(sc)
## Import CSV File
header = (sc.textFile("data.csv")
          .map(lambda line: [x for x in line.split(",")]))

## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'desc'])

## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))

headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.desc from headerTab as d").show()

I have noticed that this seems to work in Spark 2.0, but Im limited to 1.6 at the moment.

This is the error message I am seeing. For a simple select I could remove the alias, but ultimately I am trying to do a join with multiple tables that have the same column names.

Spark 1.6 Error

Traceback (most recent call last):
  File "/home/temp/text_import.py", line 49, in <module>
    head = sqlCtx.sql("select d.desc from headerTab as d").show()
  File "/home/pricing/spark-1.6.1/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql
  File "/home/pricing/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/home/pricing/spark-1.6.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/home/pricing/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o64.sql.
: java.lang.RuntimeException: [1.10] failure: ``*'' expected but `desc' found

Spark 2.0 Returns

+--------+
|    desc|
+--------+
|    data|
|    data|
|    data|

Upvotes: 3

Views: 5469

Answers (2)

Rakesh Kumar
Rakesh Kumar

Reputation: 4420

Basically you are using keyword desc in column name, which is inappropriate. You can resolve this in 2 ways either change column name or use symbol (`) around keyword desc.

Way 1:-

sqlCtx = SQLContext(sc)
## Import CSV File
header = (sc.textFile("data.csv")
          .map(lambda line: [x for x in line.split(",")]))

## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'description'])

## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))

headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.description from headerTab as d").show()

Way 2:-

sqlCtx = SQLContext(sc)
## Import CSV File
header = (sc.textFile("data.csv")
          .map(lambda line: [x for x in line.split(",")]))

## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'desc'])

## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))

headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.`desc` from headerTab as d").show()

Upvotes: 3

Jeston
Jeston

Reputation: 396

As stated in the comment below the question, the use of desc is inappropriate because its a keyword. Changing the name of the column fixes the issue.

## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'descTmp'])

## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))

headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.descTmp from headerTab as d").show()

+-----------+
|    descTmp|
+-----------+
|       data|
|       data|
|       data|

Upvotes: 1

Related Questions