Béatrice M.
Béatrice M.

Reputation: 972

Scala Spark - Map function referencing another dataframe

I have two dataframes:

df1:

+---+------+----+
| id|weight|time|
+---+------+----+
|  A|   0.1|   1|
|  A|   0.2|   2|
|  A|   0.3|   4|
|  A|   0.4|   5|
|  B|   0.5|   1|
|  B|   0.7|   3|
|  B|   0.8|   6|
|  B|   0.9|   7|
|  B|   1.0|   8|
+---+------+----+

df2:

+---+---+-------+-----+
| id|  t|t_start|t_end|
+---+---+-------+-----+
|  A| t1|      0|    3|
|  A| t2|      4|    6|
|  A| t3|      7|    9|
|  B| t1|      0|    2|
|  B| t2|      3|    6|
|  B| t3|      7|    9|
+---+---+-------+-----+

My desired output is to identify the 't' for each time stamp in df1, where the ranges of 't' are in df2.

df_output:

+---+------+----+---+
| id|weight|time| t |
+---+------+----+---+
|  A|   0.1|   1| t1|
|  A|   0.2|   2| t1|
|  A|   0.3|   4| t2|
|  A|   0.4|   5| t2|
|  B|   0.5|   1| t1|
|  B|   0.7|   3| t2|
|  B|   0.8|   6| t2|
|  B|   0.9|   7| t3|
|  B|   1.0|   8| t3|
+---+------+----+---+

My understanding so far is that I must create an udf that takes the column 'id and 'time as inputs, map for each row, by refering to df2.filter(df2.id == df1.id, df1.time >= df2.t_start, df1.time <= df2.t_end), and get the correspondingdf2.t`

I'm very new to Scala and Spark, so I am wondering if this solution is even possible?

Upvotes: 2

Views: 1331

Answers (1)

zero323
zero323

Reputation: 330413

You cannot use UDF for that but all you have to do is to reuse filter condition you already defined to join both frames:

df1.join(
  df2,
  df2("id") === df1("id") && df1("time").between(df2("t_start"), df2("t_end"))
)

Upvotes: 2

Related Questions