Reputation: 61

PySpark - For every DF1 row apply a random 40% of the DF2 row

I have 2 DFs. For every row in DF1, I wanted to join(cross join) random 40% of the data from DF2. For example, for the first row in DF1, I would take 40% of data from DF2 and do a cross join between them. The same way 2nd row from DF1 and cross join with 40% of DF2 data. Then produce the output.

DF1

col1	col2
1	a
2	b
3	c

DF2

col1	col2
x	kk
y	zz
z	mm
l	gg

For every DF1 row cross joining with a random 40% of the DF2 row. output

col1	col2	col3	col4
1	a	x	kk
1	a	l	gg
2	b	z	mm
2	b	x	kk
3	c	x	kk
3	c	y	zz

Upvotes: 0

Answers (1)

Napoleon Borntoparty

Reputation: 1962

You could just use pyspark.sql.DataFrame.sample to get a 40% sample (based on rowcount * 0.4) of DF2 and then crossJoin this onto DF1. Couple of caveats:

This will be always the same sample (e.g. random slice based on a given seed)
Taking a new sample for each row would constitute a row-based operation (i.e. you'd likely need a UDF/Pandas UDF) and the performance would be atrocious
You should be setting the seed for your sample function explicitly to have reproducible results

Upvotes: 0

PySpark - For every DF1 row apply a random 40% of the DF2 row

Answers (1)

Related Questions