Reputation: 61
I have 2 DFs. For every row in DF1, I wanted to join(cross join) random 40% of the data from DF2. For example, for the first row in DF1, I would take 40% of data from DF2 and do a cross join between them. The same way 2nd row from DF1 and cross join with 40% of DF2 data. Then produce the output.
DF1
col1 | col2 |
---|---|
1 | a |
2 | b |
3 | c |
DF2
col1 | col2 |
---|---|
x | kk |
y | zz |
z | mm |
l | gg |
For every DF1 row cross joining with a random 40% of the DF2 row. output
col1 | col2 | col3 | col4 |
---|---|---|---|
1 | a | x | kk |
1 | a | l | gg |
2 | b | z | mm |
2 | b | x | kk |
3 | c | x | kk |
3 | c | y | zz |
Upvotes: 0
Views: 221
Reputation: 1962
You could just use pyspark.sql.DataFrame.sample
to get a 40% sample (based on rowcount * 0.4
) of DF2
and then crossJoin
this onto DF1
.
Couple of caveats:
Upvotes: 0