Group by date range overlapping using PySpark

Question

I am trying to perform a kind of grouping with users who have used the same IP in a overlapping range of dates.

This will let me know if two users share the same house because they have the same IP at the same time.

Also, I've been trying to implement it, but I can't find a way to do it with PySpark SQL. In fact, I think it can't be done with PySpark, and probably requires some other graph-oriented library.

The problem is the following:

| ip          | user       | start_date | end_date   |
| ----------- | ---------- | ---------- | ---------- |
| 192.168.1.1 | a          | 2022-01-01 | 2022-01-03 |
| 192.168.1.1 | a          | 2022-01-05 | 2022-01-07 |
| 192.168.1.1 | b          | 2022-01-06 | 2022-01-09 |
| 192.168.1.1 | c          | 2022-01-08 | 2022-01-11 |
| 192.168.1.2 | d          | 2022-01-08 | 2022-01-11 |
| 192.168.1.2 | e          | 2022-01-10 | 2022-01-11 |
| 192.168.1.2 | f          | 2022-01-16 | 2022-01-18 |

As we can see:

the users a, b overlaps in range and same ip.
the users b, c overlaps in range and same ip.
indirectly the users a and c are in the same group.
the users d, e overlaps in range and same ip.
the user f not overlap with respect other user.

Expected output:

| ip          | users       | date_ranges
| ----------- | ----------- | ------------------- | ------------------- |
| 192.168.1.1 | {a, b, c}   | {2022-01-01 - 2022-01-03, 2022-01-05 - 2022-01-07, 2022-01-06 - 2022-01-09, 2022-01-08 - 2022-01-11} |
| 192.168.1.2 | {d, e}      | {2022-01-08 - 2022-01-11, 2022-01-10-2022-01-11} |
| 192.168.1.1 | {f}         | {2022-01-16 - 2022-01-18} |

Do you have any ideas on how to implement this?

I thought about using GraphFrames, but I don't even know where to start :S

Group by date range overlapping using PySpark

Answers (1)

Related Questions