Reputation: 13
I'm just starting to learn and faced one problem in Python.
I have a srt doc (subtitles). Name - sub. It looks like:
8
00:01:03,090 --> 00:01:05,260
<b><font color="#008080">MATER:</font></b> Yes, sir, you did.
<b><font color="#808000">(MCQUEEN GASPS)</font></b>
9
00:01:05,290 --> 00:01:07,230
You used to say
that all the time.
In Python it looks like:
'3', '00:00:46,570 --> 00:00:48,670', '<b><font color="#008080">MCQUEEN:</font></b> Okay, here we go.', '', '4', '00:00:48,710 --> 00:00:52,280', 'Focus. Speed. I am speed.', '', '5', '00:00:52,310 --> 00:00:54,250', '<b><font color="#808000">(ENGINES ROARING)</font></b>', '',
Also, I had a list of words (name - noun). It looks like:
['man', 'poster', 'motivation', 'son' ... 'boy']
Let's look at this example:
...'4', '00:00:48,710 --> 00:00:52,280', 'Focus. Speed. I am speed.', '', '5',....
What I need to do is to find word from the list in the subtitles (first apperrence, as an illustrtion, "Speed") and get into list the time of the word appearence (00:00:48,710 --> 00:00:52,280) and sequence number (4), which is located before the time in the document. I was trying to get this information with indx but unfortunately I did not succeed.
Can you help me how to do this?)
Upvotes: 1
Views: 94
Reputation: 336
Continuing with Anton vBR's suggestion:
words=['ingonyama','king']
results=[]
for w in words:
for row in df.itertuples():
if row[2] is not None:
if w in row[2].lower():
results.append((w, row[0], row[1]))
if row[3] is not None:
if w in row[3].lower():
results.append((w, row[0], row[1]))
print(results)
You'll get a list of tuples, each of which contains a word you're searching for, a sequence number where it appears, and a time-frame where it appears. Then you can write these tuples to a csv file or whatever. Hope this helps.
Upvotes: 0
Reputation: 18916
Welcome to SO and Python. Although this is not an answer I think it might be helpful. The one and only Python library for tables is Pandas. You can read in the srt file to a dataframe and work your way from there. (You would need to learn the Pandas syntax do to stuff but it is well-invested time)
import pandas as pd
import requests
# Lion King subtitle
data = requests.get("https://opensubtitles.co/download/67071").text
df = pd.DataFrame([i.split("\r\n") for i in data.split("\r\n\r\n")])
df = df.rename(columns={0:"Index",1:"Time",2:"Row1",3:"Row2"}).set_index("Index")
Printing first 5 rows print(df.head())
gives:
Time Row1 Row2
Index
1 00:01:01,600 --> 00:01:05,800 <i>Nants ingonyama</i> None
2 00:01:05,900 --> 00:01:07,200 <i>Bagithi baba</i> None
3 00:01:07,300 --> 00:01:10,600 <i>Sithi uhhmm ingonyama</i> None
4 00:01:10,700 --> 00:01:13,300 <i>lngonyama</i> None
5 00:01:13,300 --> 00:01:16,400 <i>Nants ingonyama</i> None
Upvotes: 1