Izuchi
Izuchi

Reputation: 11

Compare two text files and write matching values to a text file

I have two text files: Speech.txt and Script.txt. Speech.txt contains a list of filenames of audio files and Script.txt contains the relevant transcript. Script.txt contains transcripts for all characters and items, however I only want the transcript for a specific character only. I want to write a python script that compares the filename to the transcript and returns a text file containing the file path, filename, extension and the transcript seperated by |.

Sample of Speech.txt:

0x000f4a03.wav
0x000f4a07.wav
0x000f4a0f.wav

Sample of Script.txt:

0x000f4a0f |            | And unites the clans against Nilfgaard?
0x000f4a11 |            | Of course. He's already decreed new longships be built.
0x000f4a03 |            | Thinking long-term, then. Think she'll succeed?
0x000f4a05 |            | She's got a powerful ally. In me.
0x000f4a07 |            | Son's King of Skellige. Congratulations to you.

Expected Output:

C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?

Code (work in progress):

f1=open(r'C:/Speech.txt',"r", encoding='utf8')
f2=open(r'C:/script.txt',"r", encoding='utf8')
for line1 in f1:
    for line2 in f2:
        if line1[0:10]==line2[0:10]:
              print('C:/Speech/' + line2[0:10] + '.wav' + '|' + line2[26:-1])              
f1.close()
f2.close()

The above code seems to only work for the first line in Speech.txt and then stops. I want it to run through the entire file i.e. line 2, line 3 ...etc. I also haven't figured out how to output the results into a text file. I can only print out the results at the moment. Any help would be appreciated!

EDIT Links to Script.txt and Speech.txt.

Upvotes: 1

Views: 1360

Answers (4)

Cristian Ramon-Cortes
Cristian Ramon-Cortes

Reputation: 1888

For each line of the Speech.txt file, you need to check if it exists or not in the Script.txt file. Considering that the content of Script.txt fits in memory you should load its content to avoid reading it every time.

Once the content of Script.txt is loaded, you simply process each line of the Speech.txt, search it in the dictionary and print it when required.

Next, I provide the code. Notice that:

  • I have added debug information. You can hide it by executing python -O script.py
  • I use os.path.splittext(var)[0] to remove the extension from the filename
  • I strip every processed line to get rid of spaces/line breaks.

Code:

#!/usr/bin/python

# -*- coding: utf-8 -*-

# For better print formatting
from __future__ import print_function

# Imports
import sys
import os


#
# HELPER METHODS
#
def load_script_file(script_file_path):
    # Parse each line of the script file and load to a dictionary
    d = {}
    with open(script_file_path, "r") as f:
        for transcript_info in f:
            if __debug__:
                print("Loading line: " + str(transcript_info))
            speech_filename, _, transcription = transcript_info.split("|")
            speech_filename = speech_filename.strip()
            transcription = transcription.strip()
            d[speech_filename] = transcription

    if __debug__:
        print("Loaded values: " + str(d))
    return d


#
# MAIN METHODS
#

def main(speech_file_path, script_file_path, output_file):
    # Load the script data into a dictionary
    speech_to_transcript = load_script_file(script_file_path)

    # Check each speech entry
    with open(speech_file_path, "r") as f:
        for speech_audio_file in f:
            speech_audio_file = speech_audio_file.strip()
            if __debug__:
                print()
                print("Checking speech file: " + str(speech_audio_file))

            # Remove extension
            speech_code = os.path.splitext(speech_audio_file)[0]
            if __debug__:
                print(" + Obtained filename: " + speech_code)

            # Find entry in transcript
            if speech_code in speech_to_transcript.keys():
                if __debug__:
                    print(" + Filename registered. Loading transcript")
                transcript = speech_to_transcript[speech_code]
                if __debug__:
                    print(" + Transcript: " + str(transcript))

                # Print information
                output_line = "C:/Speech/" + speech_audio_file + "|" + transcript
                if output_file is None:
                    print(output_line)
                else:
                    with open(output_file, 'a') as fw:
                        fw.write(output_line + "\n")
            else:
                if __debug__:
                    print(" + Filename not registered")


#
# ENTRY POINT
#
if __name__ == '__main__':
    # Parse arguments
    args = sys.argv[1:]
    speech = str(args[0])
    script = str(args[1])
    if len(args) == 3:
        output = str(args[2])
    else:
        output = None

    # Log arguments if required
    if __debug__:
        print("Running with:")
        print(" - SPEECH FILE = " + str(speech))
        print(" - SCRIPT FILE = " + str(script))
        print(" - OUTPUT FILE = " + str(output))
        print()

    # Execute main
    main(speech, script, output)

Debug Output:

$ python speech_transcript.py ./Speech.txt ./Script.txt
Running with:
 - SPEECH FILE = ./Speech.txt
 - SCRIPT FILE = ./Script.txt

Loaded values: {'0x000f4a03': "Thinking long-term, then. Think she'll succeed?", '0x000f4a11': "Of course. He's already decreed new longships be built.", '0x000f4a05': "She's got a powerful ally. In me.", '0x000f4a07': "Son's King of Skellige. Congratulations to you.", '0x000f4a0f': 'And unites the clans against Nilfgaard?'}

Checking speech file: 0x000f4a03.wav
 + Obtained filename: 0x000f4a03
 + Filename registered. Loading transcript
 + Transcript: Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?

Checking speech file: 0x000f4a07.wav
 + Obtained filename: 0x000f4a07
 + Filename registered. Loading transcript
 + Transcript: Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.

Checking speech file: 0x000f4a0f.wav
 + Obtained filename: 0x000f4a0f
 + Filename registered. Loading transcript
 + Transcript: And unites the clans against Nilfgaard?
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?

Output:

$ python -O speech_transcript.py ./Speech.txt ./Script.txt 
C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?

Output writing to file:

$ python -O speech_transcript.py ./Speech.txt ./Script.txt ./output.txt
$ more output.txt 
C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?

Upvotes: 1

RoadRunner
RoadRunner

Reputation: 26325

I would read the Script.txt contents into a dictionary, then use this dictionary as your iterate the lines from Speech.txt, and only print lines that exist. This avoids the need to iterate the file multiple times, which could be quite slow if you have large files.

Demo:

from pathlib import Path

with open("Speech.txt") as speech_file, open("Script.txt") as script_file:
    script_dict = {}
    for line in script_file:
        key, _, text = map(str.strip, line.split("|"))
        script_dict[key] = text

    for line in map(str.strip, speech_file):
        filename = Path(line).stem
        if filename in script_dict:
            print(f"C:\Speech\{line}|{script_dict[filename]}")

Output:

C:\Speech\0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:\Speech\0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:\Speech\0x000f4a0f.wav|And unites the clans against Nilfgaard?

Its also much easier to use With Statement Context Managers to open your files, since you don't need to call .close() to close your file, because it handles that for you.

I also used pathlib.PurePath.stem to get the filename from your .wav files. I find this easier to use than the os.path.basename os.path.spltext functions. Although this is personal preference and all will work.

if we want to write the output to a text file, we can open another output file in write mode using mode="w":

from pathlib import Path

with open("Speech.txt") as speech_file, open("Script.txt") as script_file, open("output.txt", mode="w") as output_file:
    script_dict = {}
    for line in script_file:
        key, _, text = map(str.strip, line.split("|"))
        script_dict[key] = text

    for line in map(str.strip, speech_file):
        filename = Path(line).stem
        if filename in script_dict:
            output_file.write(f"C:\Speech\{line}|{script_dict[filename]}\n")

output.txt

C:\Speech\0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:\Speech\0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:\Speech\0x000f4a0f.wav|And unites the clans against Nilfgaard?

You can have a look at Reading and Writing Files from the documentation for more information on how to read and write files in python.

Upvotes: 1

gold_cy
gold_cy

Reputation: 14226

Using pandas is another approach as well since this seems like your typical join problem.

import pandas as pd

df = pd.read_csv('speech.txt', header=None, names=['name'])
df1 = pd.read_csv('script.txt', sep='|', header=None, names=['name', 'blank', 'description'])

df1['name'] = df1.name.str.strip() + '.wav'

final = pd.merge(df, df1, how='left', left_on='name', right_on='name')
final['name'] = 'C:/Speech/' + final['name']

print(final)

                       name         blank                                       description
0  C:/Speech/0x000f4a03.wav                 Thinking long-term, then. Think she'll succeed?
1  C:/Speech/0x000f4a07.wav                 Son's King of Skellige. Congratulations to you.
2  C:/Speech/0x000f4a0f.wav                         And unites the clans against Nilfgaard?

Then it is just a matter of selecting the columns you want and saving them out.

final = final[['name', 'description']]
final.to_csv('some_name.csv', index=False, sep='|')

Upvotes: 1

Patrick von Glehn
Patrick von Glehn

Reputation: 401

You can load the lines into lists with the readlines() method and then iterate over them. This avoids the problem that Kuldeep Singh Sidhu correctly ifentified of the pointer reaching the end of the file.

f1=open(r'C:/Speech.txt',"r", encoding='utf8')
f2=open(r'C:/script.txt',"r", encoding='utf8')
lines1 = f1.readlines()
lines2 = f2.readlines()
f1.close()
f2.close()

with open("output.txt","w") as outfile:
    for line1 in lines1:
        for line2 in lines2:
            if line1[0:10]==line2[0:10]:
                  outfile.write('C:/Speech/' + line2[0:10] + '.wav' + '|' + line2[26:-1],"/n")              

Upvotes: 1

Related Questions