samol
samol

Reputation: 20590

PySpark: How to select * from the left table during rdd join

How to select * in pyspark join

impression_rdd.join(
        click_rdd, 
        impression_rdd.session_id == click_rdd.session_id, 
        "left_outer"
    ).select(impression_rdd.*) <------- pseudo code; how do you do this?

Basically, the sql equivalent

SELECT impression.* FROM impression LEFT JOIN click on (impression.session_id = click.session_id)

Upvotes: 4

Views: 1577

Answers (2)

jcomeau_ictx
jcomeau_ictx

Reputation: 38452

two other equivalent constructs to zero323's answer:

(impressions.join(clicks, 'session_id', 'left_outer')
    .select(*impressions.columns))

and if you only have one column, say 'count', to drop in the right-hand table, this might be more readable.

(impressions.join(clicks, 'session_id', 'left_outer')
    .drop('count'))

Upvotes: 1

zero323
zero323

Reputation: 330203

You can simply add alias and a couple of quotes to your pseudocode:

(impressions.alias("impressions")
    .join(clicks, ["id"], "left_outer")
    .select("impressions.*"))

Upvotes: 2

Related Questions