Reputation: 11
I have been searching everywhere but I cannot seem to find an solution. My problem is like this: I have a file of transcript of a video. In the file, there are alternating lines of timestamp and caption. I want to get rid of the caption in the pattern of "\0:xx:xx\" where the x are digits. I open the file and replace newline with space so the entire file is a string, thought that might be simpler.
So, txt = open("transcriptRaw.txt").read().replace('\n', '')
and print(txt)
is:
{\rtf1\ansi\ansicpg1252\cocoartf2511\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Helvetica;}{\colortbl;\red255\green255\blue255;}{*\expandedcolortbl;;}\margl1440\margr1440\vieww10800\viewh8400\viewkind0\deftab560\pard\pardeftab560\ri0\partightenfactor0\f0\fs24 \cf0 Say that this person is to blame\0:20:48\and the others that were involved.\0:20:51\And since we're not to blame 155 in fact,\0:20:54\innocent punishment is a way to\0:20:56\express this distribution of lame\0:20:59\in certain situation as such.\0:21:01\I punishment has,
my goal is to get rid of that weird header (it's not present in the original txt file I don't know why it is there) and replace all the timestamp "\0:xx:xx" with a space. Currently there is a space after the last backslash, but originally it is a newline.
Most of the legible google searches on regex is for finding matches in multiple short strings so I can only gather that for backslash I should do "\" instead of "" or so. Hence, I have tried: re.sub("\b(\\0)\B.*?(\\)+",' ', txt)
. Theoretically, this expression find substrings in txt that starts with the literal "\0" and end with the literal "", and replace them with space. However I get an error, and the error trace is literally just errors in the built-in functions so they are not helpful at all.
My question is, should I not replace the newline and use regex in "multiline" mode? I'm not sure if that would be better. But more importantly, is how to actually solve this seemingly trivial and simple question that literally took me a full day scratching my head.
Thanks!
Upvotes: 1
Views: 89
Reputation: 18631
Use
re.sub(r'\\0[\d:]*\\+', ' ', txt)
See regex proof
Explanation
--------------------------------------------------------------------------------
\\ '\'
--------------------------------------------------------------------------------
0 '0'
--------------------------------------------------------------------------------
[\d:]* any character of: digits (0-9), ':' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\\+ '\' (1 or more times (matching the most
amount possible))
Upvotes: 1