sriks
sriks

Reputation: 95

SparkSQL "CASE WHEN THEN" with two table columns in pyspark

I have two temporary tables table_a and table_b and trying to get this query and all of its conditions to work properly.

SELECT DISTINCT CASE WHEN a.id IS NULL THEN b.id ELSE a.id END id,
    CASE WHEN a.num IS NULL THEN b.num ELSE a.num END num,
    CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END testdate
FROM table_a a
    FULL OUTER JOIN table_b b
    ON (a.id=b.id AND a.num=b.num AND a.testdate=b.testdate)
WHERE
    (CASE WHEN a.t_amt IS NULL THEN 0 ELSE a.t_amt END)
    <>
    (CASE WHEN b.t_amt IS NULL THEN 0 ELSE b.t_amt END) OR
    (CASE WHEN a.qty IS NULL THEN 0 ELSE a.qty END)
    <>
    (CASE WHEN b.qty IS NULL THEN 0 ELSE b.qty END)
ORDER BY
    CASE WHEN a.id IS NULL THEN b.id ELSE a.id END,
    CASE WHEN a.num IS NULL THEN b.num ELSE a.num END,
    CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END

Executing the above query on these two tables using SparkSQL produces the following error

sqlq = <the sql from above>

df = sqlContext.sql(sqlq)

"AnalysisException: u"cannot resolve 'a.id' given input columns: [id, num, testdate];"

Upvotes: 1

Views: 2412

Answers (1)

Flygenring
Flygenring

Reputation: 3848

It seems your error is in the ORDER BY clause, as it has no concept of tables a and b, but only the names and aliases in the SELECT clause.
This makes quite good sense since you should really only order your results based on columns actually in the result set.

SELECT DISTINCT (CASE WHEN a.id IS NULL THEN b.id ELSE a.id END) AS id,
    (CASE WHEN a.num IS NULL THEN b.num ELSE a.num END) AS num,
    (CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END) AS testdate
FROM table_a AS a
    FULL OUTER JOIN table_b AS b
    ON (a.id=b.id AND a.num=b.num AND a.testdate=b.testdate)
WHERE
    (CASE WHEN a.t_amt IS NULL THEN 0 ELSE a.t_amt END) <> (CASE WHEN b.t_amt IS NULL THEN 0 ELSE b.t_amt END)
    OR
    (CASE WHEN a.qty IS NULL THEN 0 ELSE a.qty END) <> (CASE WHEN b.qty IS NULL THEN 0 ELSE b.qty END)
ORDER BY id, num, testdate

Upvotes: 1

Related Questions