How to expand a time range into per-minute intervals in Spark (Scala or Python)?

Question

I have a dataset that has the following structure.

+-------+----------+---------------+---------------+
| tv_id | movie_id |  start_time   |   end_time    |
+-------+----------+---------------+---------------+
| tv123 | movie123 | 02/05/19 3:05 | 02/05/19 3:08 |
| tv234 | movie345 | 02/05/19 3:07 | 02/05/19 3:10 |
+-------+----------+---------------+---------------+

The output that I am trying to get is as below:

+-------+----------+---------------+
| tv_id | movie_id |    minute     |
+-------+----------+---------------+
| tv123 | movie123 | 02/05/19 3:05 |
| tv123 | movie123 | 02/05/19 3:06 |
| tv123 | movie123 | 02/05/19 3:07 |
| tv234 | movie345 | 02/05/19 3:07 |
| tv234 | movie345 | 02/05/19 3:08 |
| tv234 | movie345 | 02/05/19 3:09 |
+-------+----------+---------------+

Detailed Explanation: for tv_id: tv123, the total watch time is 3 minutes (3:08 - 3: 05) same goes for other records as well.

I am trying to use either python / Scala / or SQL to get the result. [ No restriction on the language used] My python code:

df = read_csv('data')
df[minutes_diff] = df['end_time'] - df['start_time']

for i in range(df['minutes_diff']):
    finaldf = df[tv_id] + df[movie_id] + df['start_time'] + df[minutes_diff] + "i"

I am not sure how I can go about it. I am not well versed with Scala flatmap. Some research on StackOverflow pointed to use flatmap, but I am not sure how can I use diff in flatmap inplace of aggregation.

Note: I dont want to open separate thread for SQL and Python, hence combining all of these in the same question. Even a sql solution will be perfectly good for me.

How to expand a time range into per-minute intervals in Spark (Scala or Python)?

Answers (1)

Related Questions