Reputation: 3456
My post title terminology probably isn't correct. But basically, I can do a self-join that only compares ids
with those observed (48-72) hours afterward. I do that with this code:
df.join(df.select(col("id") as "id_b",
col("timestamp") as "timestamp_b"),
(to_timestamp(timestamp).cast(long) + 172800) < (to_timestamp(timestamp_b).cast(long)) &&
(to_timestamp(timestamp).cast(long)) < (to_timestamp(timestamp).cast(long) + 259200) &&
I've read on several SO posts that using the expr()
is a "better" way of doing the above. Reasons were not given but I guess it's supposed to be more legible? Anyway, for the sake of knowledge, how would I convert the above into it's expr()
equivalent? I know how to translate between the two approaches for simpler logic. The above case is beyond me due to the need to convert the timestamp to a long and I'm not sure of the syntax for that (or if it's even possible).
Also, is there a way to break up an expr()
into multiple lines? I tried: expr(""" <logic here> """
but that gives an EOL error.
My timestamps look like:
2020-07-27 02:34:52.3452
2020-10-21 13:23:55.2355
etc.
and are stored in a Spark DataFrame like so:
|-- id: string (nullable = true)
|-- timestamp: string (nullable = true)
Upvotes: 1
Views: 55
Reputation: 42342
Here's an example. Personally I don't find expr
to be more readable in this case, but if you're interested in learning it, you can use the syntax timestamp(col)
or bigint(col)
for type casting in expr
.
df.show(false)
+---+------------------------+
|id |timestamp |
+---+------------------------+
|id1|2020-07-27 13:23:55.2355|
|id2|2020-07-28 13:23:55.2355|
|id3|2020-07-29 13:23:55.2355|
|id4|2020-07-30 13:23:55.2355|
|id5|2020-07-31 13:23:55.2355|
+---+------------------------+
val df2 = df.join(
df.select(col("id") as "id_b", col("timestamp") as "timestamp_b"),
expr("""
bigint(timestamp(timestamp)) + 172800 <= bigint(timestamp(timestamp_b))
and
bigint(timestamp(timestamp)) + 259200 >= bigint(timestamp(timestamp_b))
""")
)
df2.show(false)
+---+------------------------+----+------------------------+
|id |timestamp |id_b|timestamp_b |
+---+------------------------+----+------------------------+
|id1|2020-07-27 13:23:55.2355|id3 |2020-07-29 13:23:55.2355|
|id1|2020-07-27 13:23:55.2355|id4 |2020-07-30 13:23:55.2355|
|id2|2020-07-28 13:23:55.2355|id4 |2020-07-30 13:23:55.2355|
|id2|2020-07-28 13:23:55.2355|id5 |2020-07-31 13:23:55.2355|
|id3|2020-07-29 13:23:55.2355|id5 |2020-07-31 13:23:55.2355|
+---+------------------------+----+------------------------+
Upvotes: 1