Reputation: 19
I have 2 dataframes Events and address.
Events:
Eent_id |Type|Event_date
AA-XX-BB|SMS |1613693293023
AA-BB-DD|CALL|1613693295039
Address:
Postcode|CityName
RG15NL |Reading
SL34AD |Slough
I want to enrich the event dataset by adding address and postcode values.
As there is no common key between these two sets, am just looking for a solution to pick the random row from address file and attach to event file.
Being this a sample data, I am ok to take any random row from address file and attach to events file.
Please let me know if there a way I can achieve this as there is no common key between the two datasets.
Upvotes: 0
Views: 249
Reputation: 32720
If you don't have the same number of rows, you can try to do a cross join then pick one address for each event_id
using row_number
over a partition ordered randomly:
import org.apache.spark.sql.expressions.Window
val result = df.crossJoin(address_df).withColumn(
"rn",
row_number().over(Window.partitionBy("Eent_id").orderBy(rand()))
).filter("rn = 1").drop("rn")
result.show
//+--------+----+-------------+--------+--------+
//| Eent_id|Type| Event_date|Postcode|CityName|
//+--------+----+-------------+--------+--------+
//|AA-XX-BB| SMS|1613693293023| SL34AD| Slough|
//|AA-BB-DD|CALL|1613693295039| SL34AD| Slough|
//+--------+----+-------------+--------+--------+
Upvotes: 1