MareikeP
MareikeP

Reputation: 186

How to extract several timestamp pairs from a list in Python

I have extracted all timestamps from a transcript file. The output looks like this:

('[, 00:00:03,950, 00:00:06,840, 00:00:06,840, 00:00:09,180, 00:00:09,180, '
 '00:00:10,830, 00:00:10,830, 00:00:14,070, 00:00:14,070, 00:00:16,890, '
 '00:00:16,890, 00:00:19,080, 00:00:19,080, 00:00:21,590, 00:00:21,590, '
 '00:00:24,030, 00:00:24,030, 00:00:26,910, 00:00:26,910, 00:00:29,640, '
 '00:00:29,640, 00:00:31,920, 00:00:31,920, 00:00:35,850, 00:00:35,850, '
 '00:00:38,629, 00:00:38,629, 00:00:40,859, 00:00:40,859, 00:00:43,170, '
 '00:00:43,170, 00:00:45,570, 00:00:45,570, 00:00:48,859, 00:00:48,859, '
 '00:00:52,019, 00:00:52,019, 00:00:54,449, 00:00:54,449, 00:00:57,210, '
 '00:00:57,210, 00:00:59,519, 00:00:59,519, 00:01:02,690, 00:01:02,690, '
 '00:01:05,820, 00:01:05,820, 00:01:08,549, 00:01:08,549, 00:01:10,490, '
 '00:01:10,490, 00:01:13,409, 00:01:13,409, 00:01:16,409, 00:01:16,409, '
 '00:01:18,149, 00:01:18,149, 00:01:20,340, 00:01:20,340, 00:01:22,649, '
 '00:01:22,649, 00:01:26,159, 00:01:26,159, 00:01:28,740, 00:01:28,740, '
 '00:01:30,810, 00:01:30,810, 00:01:33,719, 00:01:33,719, 00:01:36,990, '
 '00:01:36,990, 00:01:39,119, 00:01:39,119, 00:01:41,759, 00:01:41,759, '
 '00:01:43,799, 00:01:43,799, 00:01:46,619, 00:01:46,619, 00:01:49,140, '
 '00:01:49,140, 00:01:51,240, 00:01:51,240, 00:01:53,759, 00:01:53,759, '
 '00:01:56,460, 00:01:56,460, 00:01:58,740, 00:01:58,740, 00:02:01,640, '
 '00:02:01,640, 00:02:04,409, 00:02:04,409, 00:02:07,229, 00:02:07,229, '
 '00:02:09,380, 00:02:09,380, 00:02:12,060, 00:02:12,060, 00:02:14,840, ]')

In this output, there are always timestamp pairs, i.e. always 2 consecutive timestamps belong together, for example: 00:00:03,950 and 00:00:06,840, 00:00:06,840 and 00:00:09,180, etc.

Now, I want to extract all these timestamp pairs separately so that the output looks like this:

00:00:03,950 - 00:00:06,840

00:00:06,840 - 00:00:09,180

00:00:09,180 - 00:00:10,830

etc.

For now, I have the following (very inconvenient) solution for my problem:

# get first part of first timestamp
a = res_timestamps[2:15]
print(dedent(a))

# get second part of first timestamp
b = res_timestamps[17:29]
print(b)

# combine timestamp parts
c = a + ' - ' + b
print(dedent(c))

Of course, this is very bad since I cannot extract the indices manually for all transcripts. Trying to use a loop has not worked yet because each item is not a timestamp but a single character.

Is there an elegant solution for my problem?

I appreciate any help or tip.

Thank you very much in advance!

Upvotes: 1

Views: 143

Answers (2)

Trenton McKinney
Trenton McKinney

Reputation: 62413

  • Here's a solution without regular expressions
  • Clean the string, and split on ', ' to create a list
  • Use string slicing to select the odd and even values and zip them together.
# give data as your string

# convert data into a list by removing end brackets and spaces, and splitting
data = data.replace('[, ', '').replace(', ]', '').split(', ')

# use list slicing and zip the two components
combinations = list(zip(data[::2], data[1::2]))

# print the first 5
print(combinations[:5])
[out]:

[('00:00:03,950', '00:00:06,840'),
 ('00:00:06,840', '00:00:09,180'),
 ('00:00:09,180', '00:00:10,830'),
 ('00:00:10,830', '00:00:14,070'),
 ('00:00:14,070', '00:00:16,890')]

Upvotes: 1

Julia
Julia

Reputation: 2085

Regex to the rescue!

A solution that works perfectly on your example data:

import re
from pprint import pprint

pprint(re.findall(r"(\d{2}:\d{2}:\d{2},\d{3}), (\d{2}:\d{2}:\d{2},\d{3})", your_data))

This prints:

[('00:00:03,950', '00:00:06,840'),
 ('00:00:06,840', '00:00:09,180'),
 ('00:00:09,180', '00:00:10,830'),
 ('00:00:10,830', '00:00:14,070'),
 ('00:00:14,070', '00:00:16,890'),
 ('00:00:16,890', '00:00:19,080'),
 ('00:00:19,080', '00:00:21,590'),
 ('00:00:21,590', '00:00:24,030'),
 ('00:00:24,030', '00:00:26,910'),
 ('00:00:26,910', '00:00:29,640'),
 ('00:00:29,640', '00:00:31,920'),
 ('00:00:31,920', '00:00:35,850'),
 ('00:00:35,850', '00:00:38,629'),
 ('00:00:38,629', '00:00:40,859'),
 ('00:00:40,859', '00:00:43,170'),
 ('00:00:43,170', '00:00:45,570'),
 ('00:00:45,570', '00:00:48,859'),
 ('00:00:48,859', '00:00:52,019'),
 ('00:00:52,019', '00:00:54,449'),
 ('00:00:54,449', '00:00:57,210'),
 ('00:00:57,210', '00:00:59,519'),
 ('00:00:59,519', '00:01:02,690'),
 ('00:01:02,690', '00:01:05,820'),
 ('00:01:05,820', '00:01:08,549'),
 ('00:01:08,549', '00:01:10,490'),
 ('00:01:10,490', '00:01:13,409'),
 ('00:01:13,409', '00:01:16,409'),
 ('00:01:16,409', '00:01:18,149'),
 ('00:01:18,149', '00:01:20,340'),
 ('00:01:20,340', '00:01:22,649'),
 ('00:01:22,649', '00:01:26,159'),
 ('00:01:26,159', '00:01:28,740'),
 ('00:01:28,740', '00:01:30,810'),
 ('00:01:30,810', '00:01:33,719'),
 ('00:01:33,719', '00:01:36,990'),
 ('00:01:36,990', '00:01:39,119'),
 ('00:01:39,119', '00:01:41,759'),
 ('00:01:41,759', '00:01:43,799'),
 ('00:01:43,799', '00:01:46,619'),
 ('00:01:46,619', '00:01:49,140'),
 ('00:01:49,140', '00:01:51,240'),
 ('00:01:51,240', '00:01:53,759'),
 ('00:01:53,759', '00:01:56,460'),
 ('00:01:56,460', '00:01:58,740'),
 ('00:01:58,740', '00:02:01,640'),
 ('00:02:01,640', '00:02:04,409'),
 ('00:02:04,409', '00:02:07,229'),
 ('00:02:07,229', '00:02:09,380'),
 ('00:02:09,380', '00:02:12,060'),
 ('00:02:12,060', '00:02:14,840')]

You could output this in your desired format like so:

for start, end in timestamps:
    print(f"{start} - {end}")

Upvotes: 1

Related Questions