Reputation: 186
I have extracted all timestamps from a transcript file. The output looks like this:
('[, 00:00:03,950, 00:00:06,840, 00:00:06,840, 00:00:09,180, 00:00:09,180, '
'00:00:10,830, 00:00:10,830, 00:00:14,070, 00:00:14,070, 00:00:16,890, '
'00:00:16,890, 00:00:19,080, 00:00:19,080, 00:00:21,590, 00:00:21,590, '
'00:00:24,030, 00:00:24,030, 00:00:26,910, 00:00:26,910, 00:00:29,640, '
'00:00:29,640, 00:00:31,920, 00:00:31,920, 00:00:35,850, 00:00:35,850, '
'00:00:38,629, 00:00:38,629, 00:00:40,859, 00:00:40,859, 00:00:43,170, '
'00:00:43,170, 00:00:45,570, 00:00:45,570, 00:00:48,859, 00:00:48,859, '
'00:00:52,019, 00:00:52,019, 00:00:54,449, 00:00:54,449, 00:00:57,210, '
'00:00:57,210, 00:00:59,519, 00:00:59,519, 00:01:02,690, 00:01:02,690, '
'00:01:05,820, 00:01:05,820, 00:01:08,549, 00:01:08,549, 00:01:10,490, '
'00:01:10,490, 00:01:13,409, 00:01:13,409, 00:01:16,409, 00:01:16,409, '
'00:01:18,149, 00:01:18,149, 00:01:20,340, 00:01:20,340, 00:01:22,649, '
'00:01:22,649, 00:01:26,159, 00:01:26,159, 00:01:28,740, 00:01:28,740, '
'00:01:30,810, 00:01:30,810, 00:01:33,719, 00:01:33,719, 00:01:36,990, '
'00:01:36,990, 00:01:39,119, 00:01:39,119, 00:01:41,759, 00:01:41,759, '
'00:01:43,799, 00:01:43,799, 00:01:46,619, 00:01:46,619, 00:01:49,140, '
'00:01:49,140, 00:01:51,240, 00:01:51,240, 00:01:53,759, 00:01:53,759, '
'00:01:56,460, 00:01:56,460, 00:01:58,740, 00:01:58,740, 00:02:01,640, '
'00:02:01,640, 00:02:04,409, 00:02:04,409, 00:02:07,229, 00:02:07,229, '
'00:02:09,380, 00:02:09,380, 00:02:12,060, 00:02:12,060, 00:02:14,840, ]')
In this output, there are always timestamp pairs, i.e. always 2 consecutive timestamps belong together, for example: 00:00:03,950
and 00:00:06,840
, 00:00:06,840
and 00:00:09,180
, etc.
Now, I want to extract all these timestamp pairs separately so that the output looks like this:
00:00:03,950 - 00:00:06,840
00:00:06,840 - 00:00:09,180
00:00:09,180 - 00:00:10,830
etc.
For now, I have the following (very inconvenient) solution for my problem:
# get first part of first timestamp
a = res_timestamps[2:15]
print(dedent(a))
# get second part of first timestamp
b = res_timestamps[17:29]
print(b)
# combine timestamp parts
c = a + ' - ' + b
print(dedent(c))
Of course, this is very bad since I cannot extract the indices manually for all transcripts. Trying to use a loop has not worked yet because each item is not a timestamp but a single character.
Is there an elegant solution for my problem?
I appreciate any help or tip.
Thank you very much in advance!
Upvotes: 1
Views: 143
Reputation: 62413
', '
to create a list# give data as your string
# convert data into a list by removing end brackets and spaces, and splitting
data = data.replace('[, ', '').replace(', ]', '').split(', ')
# use list slicing and zip the two components
combinations = list(zip(data[::2], data[1::2]))
# print the first 5
print(combinations[:5])
[out]:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890')]
Upvotes: 1
Reputation: 2085
Regex to the rescue!
A solution that works perfectly on your example data:
import re
from pprint import pprint
pprint(re.findall(r"(\d{2}:\d{2}:\d{2},\d{3}), (\d{2}:\d{2}:\d{2},\d{3})", your_data))
This prints:
[('00:00:03,950', '00:00:06,840'),
('00:00:06,840', '00:00:09,180'),
('00:00:09,180', '00:00:10,830'),
('00:00:10,830', '00:00:14,070'),
('00:00:14,070', '00:00:16,890'),
('00:00:16,890', '00:00:19,080'),
('00:00:19,080', '00:00:21,590'),
('00:00:21,590', '00:00:24,030'),
('00:00:24,030', '00:00:26,910'),
('00:00:26,910', '00:00:29,640'),
('00:00:29,640', '00:00:31,920'),
('00:00:31,920', '00:00:35,850'),
('00:00:35,850', '00:00:38,629'),
('00:00:38,629', '00:00:40,859'),
('00:00:40,859', '00:00:43,170'),
('00:00:43,170', '00:00:45,570'),
('00:00:45,570', '00:00:48,859'),
('00:00:48,859', '00:00:52,019'),
('00:00:52,019', '00:00:54,449'),
('00:00:54,449', '00:00:57,210'),
('00:00:57,210', '00:00:59,519'),
('00:00:59,519', '00:01:02,690'),
('00:01:02,690', '00:01:05,820'),
('00:01:05,820', '00:01:08,549'),
('00:01:08,549', '00:01:10,490'),
('00:01:10,490', '00:01:13,409'),
('00:01:13,409', '00:01:16,409'),
('00:01:16,409', '00:01:18,149'),
('00:01:18,149', '00:01:20,340'),
('00:01:20,340', '00:01:22,649'),
('00:01:22,649', '00:01:26,159'),
('00:01:26,159', '00:01:28,740'),
('00:01:28,740', '00:01:30,810'),
('00:01:30,810', '00:01:33,719'),
('00:01:33,719', '00:01:36,990'),
('00:01:36,990', '00:01:39,119'),
('00:01:39,119', '00:01:41,759'),
('00:01:41,759', '00:01:43,799'),
('00:01:43,799', '00:01:46,619'),
('00:01:46,619', '00:01:49,140'),
('00:01:49,140', '00:01:51,240'),
('00:01:51,240', '00:01:53,759'),
('00:01:53,759', '00:01:56,460'),
('00:01:56,460', '00:01:58,740'),
('00:01:58,740', '00:02:01,640'),
('00:02:01,640', '00:02:04,409'),
('00:02:04,409', '00:02:07,229'),
('00:02:07,229', '00:02:09,380'),
('00:02:09,380', '00:02:12,060'),
('00:02:12,060', '00:02:14,840')]
You could output this in your desired format like so:
for start, end in timestamps:
print(f"{start} - {end}")
Upvotes: 1