Naveen
Naveen

Reputation: 19

Join two dataframes with random in spark scala

I have 2 dataframes Events and address.

Events:

Eent_id |Type|Event_date
AA-XX-BB|SMS |1613693293023
AA-BB-DD|CALL|1613693295039

Address:

Postcode|CityName
RG15NL  |Reading
SL34AD  |Slough

I want to enrich the event dataset by adding address and postcode values.

As there is no common key between these two sets, am just looking for a solution to pick the random row from address file and attach to event file.

Being this a sample data, I am ok to take any random row from address file and attach to events file.

Please let me know if there a way I can achieve this as there is no common key between the two datasets.

Upvotes: 0

Views: 249

Answers (1)

blackbishop
blackbishop

Reputation: 32720

If you don't have the same number of rows, you can try to do a cross join then pick one address for each event_id using row_number over a partition ordered randomly:

import org.apache.spark.sql.expressions.Window

val result = df.crossJoin(address_df).withColumn(
    "rn",
    row_number().over(Window.partitionBy("Eent_id").orderBy(rand()))
  ).filter("rn = 1").drop("rn")

result.show
//+--------+----+-------------+--------+--------+
//| Eent_id|Type|   Event_date|Postcode|CityName|
//+--------+----+-------------+--------+--------+
//|AA-XX-BB| SMS|1613693293023|  SL34AD|  Slough|
//|AA-BB-DD|CALL|1613693295039|  SL34AD|  Slough|
//+--------+----+-------------+--------+--------+

Upvotes: 1

Related Questions