Pyspark join dataframe on comma separted values in a column

Question

So i have two data frames which i want to join. The catch is the second table has comma separted values stored in it out of which one matches with the column in Table A. How do I it in Pyspark. Below is an example

Table A has

+-------+--------------------+
|deal_id|           deal_name|
+-------+--------------------+
| 613760|ABCDEFGHI           |
| 613740|TEST123             |
| 598946|OMG                 |

Table B has

+-------+---------------------------+--------------------+
|                            deal_id|           deal_type|                           
+-------+---------------------------+--------------------+
| 613760,613761,613762,613763       |Direct De           |
| 613740,613750,613770,613780,613790|Direct              |
| 598946                            |In                  |

Expected Result - Join table A and Table B when there is a match with Table A's deal ID against Table B's comma separted value. For instance TableA.dealid - 613760 is in table B's 1 st row, i want that row returned.

+-------+--------------------+---------------+
|deal_id|           deal_name|      deal_type|
+-------+--------------------+---------------+
| 613760|ABCDEFGHI           |Direct De      |     
| 613740|TEST123             |Direct         |
| 598946|OMG                 |In             |

Any assistance is appreciated. I need it in pyspark.

Thanks.

Pyspark join dataframe on comma separted values in a column

Answers (1)

Related Questions