Erik Hallin
Erik Hallin

Reputation: 245

Spark SQL - Handle double quotes in column name

I have a HIVE-table with a column name similar to: column_"COLUMN_NAME"

My original query is as follows.

spark.sql("SELECT from_unixtime(unix_timestamp(substr(time, 1, 23), 'dd.MM.yyyy HH:mm:ss.SSS')) AS timestamp, cast(regexp_replace(column_"COLUMN_NAME"",',','.') as float) AS Column FROM table")

This does, evidently, not work. I've tried various ways to escape the quotation-marks in the column name, but neither backslash nor backticks solves the issue.

Any ideas?

Upvotes: 4

Views: 15474

Answers (2)

pault
pault

Reputation: 43504

You have two options here, but in both cases you need to wrap the column name containing the double quote in backticks.

First create an example

data = [
    ('01.01.2019 12:34:56.78910', '123,456')
]
df = spark.createDataFrame(data, ['time', 'column_"COLUMN_NAME"'])
df.show()
#+-------------------------+--------------------+
#|time                     |column_"COLUMN_NAME"|
#+-------------------------+--------------------+
#|01.01.2019 12:34:56.78910|123,456             |
#+-------------------------+--------------------+

# register this as a temp table
df.createOrReplaceTempView("table")

Option 1: Use triple quotes:

query = """SELECT 
from_unixtime(unix_timestamp(substr(time, 1, 23), 'dd.MM.yyyy HH:mm:ss.SSS')) AS timestamp, 
cast(regexp_replace(`column_"COLUMN_NAME"`,',','.') as float) AS Column 
FROM table"""
spark.sql(query).show()
#+-------------------+-------+
#|          timestamp| Column|
#+-------------------+-------+
#|2019-01-01 12:34:56|123.456|
#+-------------------+-------+

Option 2: Escape the double quote

query = "SELECT from_unixtime(unix_timestamp(substr(time, 1, 23), 'dd.MM.yyyy HH:mm:ss.SSS')) AS timestamp, cast(regexp_replace(`column_\"COLUMN_NAME\"`,',','.') as float) AS Column FROM table"
spark.sql(query).show()
#Same as above

Upvotes: 2

1pluszara
1pluszara

Reputation: 1528

Try this:

df.show()

+----+--------------------+
|col1|column_"COLUMN_NAME"|
+----+--------------------+
|   1|                 123|
|   2|                 245|
+----+--------------------+

from pyspark.sql import HiveContext
sqlCtx= HiveContext(sc)

df.registerTempTable("table")
sqlCtx= HiveContext(sc)
qry = """select col1,`column_"COLUMN_NAME"` from table"""
sqlCtx.sql(qry).show()

Output:

+----+--------------------+
|col1|column_"COLUMN_NAME"|
+----+--------------------+
|   1|                 123|
|   2|                 245|
+----+--------------------+

Upvotes: 0

Related Questions