Joon
Joon

Reputation: 189

PySpark and SQL : join and null values

I have pyspark dataframe consisting of two columns, each named input and target. These two are crossJoin of two single-column dataframes. Below is an example of how such dataframe would look like.

input target
A Voigt.
A Leica
A Zeiss
B Voigt.
B Leica
B Zeiss
C Voigt.
C Leica
C Zeiss

Then I have another dataframe which provides a number which describes relation between input and target column. However, it is not guaranteed that each input-target has this numerical value. For example, A - Voigt may have 2 as its relational value but A-Leica may have not have this value at all. Below is an example

input target val
A Voigt. 2
A Zeiss 1
B Leica 3
C Zeiss 5
C Leica 2

Now I want a dataframe that is congregate of these two that looks like this.

input target val
A Voigt. 2
A Leica null
A Zeiss 1
B Voigt. null
B Leica 3
B Zeiss null
C Voigt. null
C Leica 5
C Zeiss 2

I tried to join left these two columns, and tried to filter these out, but I've had problem completing in this form.

result = input_target.join(input_target_w_val, (input_target.input == input_target_w_val.input) & (input_target.target == input_target_w_val.target), 'left')

How should I put a filter from this point, or is there another way I can achieve this?

Upvotes: 0

Views: 835

Answers (3)

过过招
过过招

Reputation: 4199

Some plagiarized modifications are completely silent. Not once or twice, and you take pride in constantly copying other people's work, don't you?

Forgive me for another post, I'm really pissed off, this kind of behavior has seriously disturbed the atmosphere of the forum.

Upvotes: 0

过过招
过过招

Reputation: 4199

You can simply specify a list of join column names.

df = df1.join(df2, ['input', 'target'], 'left')

Upvotes: 1

Dipanjan Mallick
Dipanjan Mallick

Reputation: 1739

Try using it as below -

Input DataFrames

df1 = spark.createDataFrame(data=[("A","Voigt.") ,("A","Leica") ,("A","Zeiss") ,("B","Voigt.") ,("B","Leica") ,("B","Zeiss") ,("C","Voigt.") ,("C","Leica") ,("C","Zeiss")], schema = ["input", "target"])
df1.show()

+-----+------+
|input|target|
+-----+------+
|    A|Voigt.|
|    A| Leica|
|    A| Zeiss|
|    B|Voigt.|
|    B| Leica|
|    B| Zeiss|
|    C|Voigt.|
|    C| Leica|
|    C| Zeiss|
+-----+------+

df2 = spark.createDataFrame(data=[("A","Voigt.",2) ,("A","Zeiss",1 ) ,("B","Leica",3 ) ,("C","Zeiss",5 ) ,("C","Leica",2 )], schema = ["input", "target", "val"])
df2.show()

+-----+------+---+
|input|target|val|
+-----+------+---+
|    A|Voigt.|  2|
|    A| Zeiss|  1|
|    B| Leica|  3|
|    C| Zeiss|  5|
|    C| Leica|  2|
+-----+------+---+

Required Output

df1.join(df2, on = ["input", "target"], how = "left_outer").select(df1["input"], df1["target"], df2["val"]).show(truncate=False)

+-----+------+----+
|input|target|val |
+-----+------+----+
|A    |Leica |null|
|A    |Voigt.|2   |
|A    |Zeiss |1   |
|B    |Leica |3   |
|B    |Voigt.|null|
|B    |Zeiss |null|
|C    |Leica |2   |
|C    |Voigt.|null|
|C    |Zeiss |5   |
+-----+------+----+

Upvotes: 1

Related Questions