malouke
malouke

Reputation: 569

pyspark join multiple conditions

How I can specify lot of conditions in pyspark when I use .join()

Example : with hive :

query= "select a.NUMCNT,b.NUMCNT as RNUMCNT ,a.POLE,b.POLE as RPOLE,a.ACTIVITE,b.ACTIVITE as RACTIVITE FROM rapexp201412 b \
    join rapexp201412 a where (a.NUMCNT=b.NUMCNT and a.ACTIVITE = b.ACTIVITE and a.POLE =b.POLE  )\

But in PySpark I don't know how to make it because the following:

df_rapexp201412.join(df_aeveh,df_rapexp2014.ACTIVITE==df_rapexp2014.ACTIVITE and df_rapexp2014.POLE==df_aeveh.POLE,'inner')

It does not work!!

Upvotes: 39

Views: 172534

Answers (4)

Devam
Devam

Reputation: 94

What you are looking for is the following

cond = [((df1.col1 == df2.col2) &\
         (df1.col3 == df2.col4))]

joined_df = df1.join(df2, on = cond, how = "inner")

Upvotes: 1

Vzzarr
Vzzarr

Reputation: 5650

Reporting @S V Praveen reply as I had problem to express OR in the join condition:

cond = [df.surname == df3.surname, 
        (df.name == df3.name) | (df.age == df3.age),
        df.orders >= df3.orders ]

df.join(df3, cond, 'outer')

Where | represents the "OR" condition

Upvotes: 9

Mohan
Mohan

Reputation: 897

>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]

This did not work with pyspark 1.3.1. I was getting "AssertionError: joinExprs should be Column"

Instead, I used raw sql to join the data frames as shown below

df.registerTempTable("df")
df3.registerTempTable("df3")

sqlContext.sql("Select df.name,df3.age from df outer join df3 on df.name = df3.name and df.age =df3.age").collect()

Upvotes: 10

user3689574
user3689574

Reputation: 1676

Quoting from spark docs:

(https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join)

join(other, on=None, how=None) Joins with another DataFrame, using the given join expression.

The following performs a full outer join between df1 and df2.

Parameters: other – Right side of the join on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns. If on is a string or a list of string indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an inner equi-join. how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, semijoin.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
 [Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]


>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]

So you need to use the "condition as a list" option like in the last example.

Upvotes: 60

Related Questions