Sandipan Ghosh
Sandipan Ghosh

Reputation: 61

PySpark - For every DF1 row apply a random 40% of the DF2 row

I have 2 DFs. For every row in DF1, I wanted to join(cross join) random 40% of the data from DF2. For example, for the first row in DF1, I would take 40% of data from DF2 and do a cross join between them. The same way 2nd row from DF1 and cross join with 40% of DF2 data. Then produce the output.

DF1

col1 col2
1 a
2 b
3 c

DF2

col1 col2
x kk
y zz
z mm
l gg

For every DF1 row cross joining with a random 40% of the DF2 row. output

col1 col2 col3 col4
1 a x kk
1 a l gg
2 b z mm
2 b x kk
3 c x kk
3 c y zz

Upvotes: 0

Views: 221

Answers (1)

Napoleon Borntoparty
Napoleon Borntoparty

Reputation: 1962

You could just use pyspark.sql.DataFrame.sample to get a 40% sample (based on rowcount * 0.4) of DF2 and then crossJoin this onto DF1. Couple of caveats:

  • This will be always the same sample (e.g. random slice based on a given seed)
  • Taking a new sample for each row would constitute a row-based operation (i.e. you'd likely need a UDF/Pandas UDF) and the performance would be atrocious
  • You should be setting the seed for your sample function explicitly to have reproducible results

Upvotes: 0

Related Questions