Reputation: 189
I have pyspark dataframe consisting of two columns, each named input and target. These two are crossJoin of two single-column dataframes. Below is an example of how such dataframe would look like.
input | target |
---|---|
A | Voigt. |
A | Leica |
A | Zeiss |
B | Voigt. |
B | Leica |
B | Zeiss |
C | Voigt. |
C | Leica |
C | Zeiss |
Then I have another dataframe which provides a number which describes relation between input and target column. However, it is not guaranteed that each input-target has this numerical value. For example, A - Voigt may have 2 as its relational value but A-Leica may have not have this value at all. Below is an example
input | target | val |
---|---|---|
A | Voigt. | 2 |
A | Zeiss | 1 |
B | Leica | 3 |
C | Zeiss | 5 |
C | Leica | 2 |
Now I want a dataframe that is congregate of these two that looks like this.
input | target | val |
---|---|---|
A | Voigt. | 2 |
A | Leica | null |
A | Zeiss | 1 |
B | Voigt. | null |
B | Leica | 3 |
B | Zeiss | null |
C | Voigt. | null |
C | Leica | 5 |
C | Zeiss | 2 |
I tried to join left these two columns, and tried to filter these out, but I've had problem completing in this form.
result = input_target.join(input_target_w_val, (input_target.input == input_target_w_val.input) & (input_target.target == input_target_w_val.target), 'left')
How should I put a filter from this point, or is there another way I can achieve this?
Upvotes: 0
Views: 835
Reputation: 4199
Some plagiarized modifications are completely silent. Not once or twice, and you take pride in constantly copying other people's work, don't you?
Forgive me for another post, I'm really pissed off, this kind of behavior has seriously disturbed the atmosphere of the forum.
Upvotes: 0
Reputation: 4199
You can simply specify a list of join column names.
df = df1.join(df2, ['input', 'target'], 'left')
Upvotes: 1
Reputation: 1739
Try using it as below -
Input DataFrames
df1 = spark.createDataFrame(data=[("A","Voigt.") ,("A","Leica") ,("A","Zeiss") ,("B","Voigt.") ,("B","Leica") ,("B","Zeiss") ,("C","Voigt.") ,("C","Leica") ,("C","Zeiss")], schema = ["input", "target"])
df1.show()
+-----+------+
|input|target|
+-----+------+
| A|Voigt.|
| A| Leica|
| A| Zeiss|
| B|Voigt.|
| B| Leica|
| B| Zeiss|
| C|Voigt.|
| C| Leica|
| C| Zeiss|
+-----+------+
df2 = spark.createDataFrame(data=[("A","Voigt.",2) ,("A","Zeiss",1 ) ,("B","Leica",3 ) ,("C","Zeiss",5 ) ,("C","Leica",2 )], schema = ["input", "target", "val"])
df2.show()
+-----+------+---+
|input|target|val|
+-----+------+---+
| A|Voigt.| 2|
| A| Zeiss| 1|
| B| Leica| 3|
| C| Zeiss| 5|
| C| Leica| 2|
+-----+------+---+
Required Output
df1.join(df2, on = ["input", "target"], how = "left_outer").select(df1["input"], df1["target"], df2["val"]).show(truncate=False)
+-----+------+----+
|input|target|val |
+-----+------+----+
|A |Leica |null|
|A |Voigt.|2 |
|A |Zeiss |1 |
|B |Leica |3 |
|B |Voigt.|null|
|B |Zeiss |null|
|C |Leica |2 |
|C |Voigt.|null|
|C |Zeiss |5 |
+-----+------+----+
Upvotes: 1