user2205916
user2205916

Reputation: 3456

Scala: expressions vs method/function comparison

My post title terminology probably isn't correct. But basically, I can do a self-join that only compares ids with those observed (48-72) hours afterward. I do that with this code:

df.join(df.select(col("id") as "id_b",
                  col("timestamp") as "timestamp_b"),
           (to_timestamp(timestamp).cast(long) + 172800) < (to_timestamp(timestamp_b).cast(long)) && 
           (to_timestamp(timestamp).cast(long)) < (to_timestamp(timestamp).cast(long) + 259200) && 

I've read on several SO posts that using the expr() is a "better" way of doing the above. Reasons were not given but I guess it's supposed to be more legible? Anyway, for the sake of knowledge, how would I convert the above into it's expr() equivalent? I know how to translate between the two approaches for simpler logic. The above case is beyond me due to the need to convert the timestamp to a long and I'm not sure of the syntax for that (or if it's even possible).

Also, is there a way to break up an expr() into multiple lines? I tried: expr(""" <logic here> """ but that gives an EOL error.

My timestamps look like:

2020-07-27 02:34:52.3452
2020-10-21 13:23:55.2355
etc.

and are stored in a Spark DataFrame like so:

 |-- id: string (nullable = true)
 |-- timestamp: string (nullable = true)

Upvotes: 1

Views: 55

Answers (1)

mck
mck

Reputation: 42342

Here's an example. Personally I don't find expr to be more readable in this case, but if you're interested in learning it, you can use the syntax timestamp(col) or bigint(col) for type casting in expr.

df.show(false)
+---+------------------------+
|id |timestamp               |
+---+------------------------+
|id1|2020-07-27 13:23:55.2355|
|id2|2020-07-28 13:23:55.2355|
|id3|2020-07-29 13:23:55.2355|
|id4|2020-07-30 13:23:55.2355|
|id5|2020-07-31 13:23:55.2355|
+---+------------------------+

val df2 = df.join(
    df.select(col("id") as "id_b", col("timestamp") as "timestamp_b"),
    expr("""
        bigint(timestamp(timestamp)) + 172800 <= bigint(timestamp(timestamp_b)) 
        and 
        bigint(timestamp(timestamp)) + 259200 >= bigint(timestamp(timestamp_b))
    """)
)

df2.show(false)
+---+------------------------+----+------------------------+
|id |timestamp               |id_b|timestamp_b             |
+---+------------------------+----+------------------------+
|id1|2020-07-27 13:23:55.2355|id3 |2020-07-29 13:23:55.2355|
|id1|2020-07-27 13:23:55.2355|id4 |2020-07-30 13:23:55.2355|
|id2|2020-07-28 13:23:55.2355|id4 |2020-07-30 13:23:55.2355|
|id2|2020-07-28 13:23:55.2355|id5 |2020-07-31 13:23:55.2355|
|id3|2020-07-29 13:23:55.2355|id5 |2020-07-31 13:23:55.2355|
+---+------------------------+----+------------------------+

Upvotes: 1

Related Questions