Reputation: 11
I have two text files: Speech.txt and Script.txt. Speech.txt contains a list of filenames of audio files and Script.txt contains the relevant transcript. Script.txt contains transcripts for all characters and items, however I only want the transcript for a specific character only. I want to write a python script that compares the filename to the transcript and returns a text file containing the file path, filename, extension and the transcript seperated by |.
Sample of Speech.txt:
0x000f4a03.wav
0x000f4a07.wav
0x000f4a0f.wav
Sample of Script.txt:
0x000f4a0f | | And unites the clans against Nilfgaard?
0x000f4a11 | | Of course. He's already decreed new longships be built.
0x000f4a03 | | Thinking long-term, then. Think she'll succeed?
0x000f4a05 | | She's got a powerful ally. In me.
0x000f4a07 | | Son's King of Skellige. Congratulations to you.
Expected Output:
C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?
Code (work in progress):
f1=open(r'C:/Speech.txt',"r", encoding='utf8')
f2=open(r'C:/script.txt',"r", encoding='utf8')
for line1 in f1:
for line2 in f2:
if line1[0:10]==line2[0:10]:
print('C:/Speech/' + line2[0:10] + '.wav' + '|' + line2[26:-1])
f1.close()
f2.close()
The above code seems to only work for the first line in Speech.txt and then stops. I want it to run through the entire file i.e. line 2, line 3 ...etc. I also haven't figured out how to output the results into a text file. I can only print out the results at the moment. Any help would be appreciated!
EDIT Links to Script.txt and Speech.txt.
Upvotes: 1
Views: 1360
Reputation: 1888
For each line of the Speech.txt
file, you need to check if it exists or not in the Script.txt
file. Considering that the content of Script.txt
fits in memory you should load its content to avoid reading it every time.
Once the content of Script.txt
is loaded, you simply process each line of the Speech.txt
, search it in the dictionary and print it when required.
Next, I provide the code. Notice that:
python -O script.py
os.path.splittext(var)[0]
to remove the extension from the filenamestrip
every processed line to get rid of spaces/line breaks.Code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# For better print formatting
from __future__ import print_function
# Imports
import sys
import os
#
# HELPER METHODS
#
def load_script_file(script_file_path):
# Parse each line of the script file and load to a dictionary
d = {}
with open(script_file_path, "r") as f:
for transcript_info in f:
if __debug__:
print("Loading line: " + str(transcript_info))
speech_filename, _, transcription = transcript_info.split("|")
speech_filename = speech_filename.strip()
transcription = transcription.strip()
d[speech_filename] = transcription
if __debug__:
print("Loaded values: " + str(d))
return d
#
# MAIN METHODS
#
def main(speech_file_path, script_file_path, output_file):
# Load the script data into a dictionary
speech_to_transcript = load_script_file(script_file_path)
# Check each speech entry
with open(speech_file_path, "r") as f:
for speech_audio_file in f:
speech_audio_file = speech_audio_file.strip()
if __debug__:
print()
print("Checking speech file: " + str(speech_audio_file))
# Remove extension
speech_code = os.path.splitext(speech_audio_file)[0]
if __debug__:
print(" + Obtained filename: " + speech_code)
# Find entry in transcript
if speech_code in speech_to_transcript.keys():
if __debug__:
print(" + Filename registered. Loading transcript")
transcript = speech_to_transcript[speech_code]
if __debug__:
print(" + Transcript: " + str(transcript))
# Print information
output_line = "C:/Speech/" + speech_audio_file + "|" + transcript
if output_file is None:
print(output_line)
else:
with open(output_file, 'a') as fw:
fw.write(output_line + "\n")
else:
if __debug__:
print(" + Filename not registered")
#
# ENTRY POINT
#
if __name__ == '__main__':
# Parse arguments
args = sys.argv[1:]
speech = str(args[0])
script = str(args[1])
if len(args) == 3:
output = str(args[2])
else:
output = None
# Log arguments if required
if __debug__:
print("Running with:")
print(" - SPEECH FILE = " + str(speech))
print(" - SCRIPT FILE = " + str(script))
print(" - OUTPUT FILE = " + str(output))
print()
# Execute main
main(speech, script, output)
Debug Output:
$ python speech_transcript.py ./Speech.txt ./Script.txt
Running with:
- SPEECH FILE = ./Speech.txt
- SCRIPT FILE = ./Script.txt
Loaded values: {'0x000f4a03': "Thinking long-term, then. Think she'll succeed?", '0x000f4a11': "Of course. He's already decreed new longships be built.", '0x000f4a05': "She's got a powerful ally. In me.", '0x000f4a07': "Son's King of Skellige. Congratulations to you.", '0x000f4a0f': 'And unites the clans against Nilfgaard?'}
Checking speech file: 0x000f4a03.wav
+ Obtained filename: 0x000f4a03
+ Filename registered. Loading transcript
+ Transcript: Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
Checking speech file: 0x000f4a07.wav
+ Obtained filename: 0x000f4a07
+ Filename registered. Loading transcript
+ Transcript: Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
Checking speech file: 0x000f4a0f.wav
+ Obtained filename: 0x000f4a0f
+ Filename registered. Loading transcript
+ Transcript: And unites the clans against Nilfgaard?
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?
Output:
$ python -O speech_transcript.py ./Speech.txt ./Script.txt
C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?
Output writing to file:
$ python -O speech_transcript.py ./Speech.txt ./Script.txt ./output.txt
$ more output.txt
C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?
Upvotes: 1
Reputation: 26325
I would read the Script.txt
contents into a dictionary, then use this dictionary as your iterate the lines from Speech.txt
, and only print lines that exist. This avoids the need to iterate the file multiple times, which could be quite slow if you have large files.
Demo:
from pathlib import Path
with open("Speech.txt") as speech_file, open("Script.txt") as script_file:
script_dict = {}
for line in script_file:
key, _, text = map(str.strip, line.split("|"))
script_dict[key] = text
for line in map(str.strip, speech_file):
filename = Path(line).stem
if filename in script_dict:
print(f"C:\Speech\{line}|{script_dict[filename]}")
Output:
C:\Speech\0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:\Speech\0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:\Speech\0x000f4a0f.wav|And unites the clans against Nilfgaard?
Its also much easier to use With Statement Context Managers to open your files, since you don't need to call .close()
to close your file, because it handles that for you.
I also used pathlib.PurePath.stem
to get the filename from your .wav
files. I find this easier to use than the os.path.basename
os.path.spltext
functions. Although this is personal preference and all will work.
if we want to write the output to a text file, we can open another output file in write mode using mode="w"
:
from pathlib import Path
with open("Speech.txt") as speech_file, open("Script.txt") as script_file, open("output.txt", mode="w") as output_file:
script_dict = {}
for line in script_file:
key, _, text = map(str.strip, line.split("|"))
script_dict[key] = text
for line in map(str.strip, speech_file):
filename = Path(line).stem
if filename in script_dict:
output_file.write(f"C:\Speech\{line}|{script_dict[filename]}\n")
output.txt
C:\Speech\0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:\Speech\0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:\Speech\0x000f4a0f.wav|And unites the clans against Nilfgaard?
You can have a look at Reading and Writing Files from the documentation for more information on how to read and write files in python.
Upvotes: 1
Reputation: 14226
Using pandas
is another approach as well since this seems like your typical join problem.
import pandas as pd
df = pd.read_csv('speech.txt', header=None, names=['name'])
df1 = pd.read_csv('script.txt', sep='|', header=None, names=['name', 'blank', 'description'])
df1['name'] = df1.name.str.strip() + '.wav'
final = pd.merge(df, df1, how='left', left_on='name', right_on='name')
final['name'] = 'C:/Speech/' + final['name']
print(final)
name blank description
0 C:/Speech/0x000f4a03.wav Thinking long-term, then. Think she'll succeed?
1 C:/Speech/0x000f4a07.wav Son's King of Skellige. Congratulations to you.
2 C:/Speech/0x000f4a0f.wav And unites the clans against Nilfgaard?
Then it is just a matter of selecting the columns you want and saving them out.
final = final[['name', 'description']]
final.to_csv('some_name.csv', index=False, sep='|')
Upvotes: 1
Reputation: 401
You can load the lines into lists with the readlines() method and then iterate over them. This avoids the problem that Kuldeep Singh Sidhu correctly ifentified of the pointer reaching the end of the file.
f1=open(r'C:/Speech.txt',"r", encoding='utf8')
f2=open(r'C:/script.txt',"r", encoding='utf8')
lines1 = f1.readlines()
lines2 = f2.readlines()
f1.close()
f2.close()
with open("output.txt","w") as outfile:
for line1 in lines1:
for line2 in lines2:
if line1[0:10]==line2[0:10]:
outfile.write('C:/Speech/' + line2[0:10] + '.wav' + '|' + line2[26:-1],"/n")
Upvotes: 1