DazedFury
DazedFury

Reputation: 59

Writing to a file from multiple threads in the correct order

I'm trying to read in a file, translate it using a remote api endpoint, then write it to a file.

It was really slow due to each request taking 2-3 seconds, so I've opted to using threads to speed up the translation by hitting the endpoint multiple times in parallel (As recommended in their api docs)

However I'm having trouble coming up with a way to write the translated lines in the correct order. Race Conditions I suppose. I'm thinking the issue is that I'm writing to a single file from multiple threads. So I would need a queue or something, but I have no idea how to approach it.

Main()

#Open File
for filename in os.listdir("files"):
    with open('translate/' + filename, 'w', encoding='UTF-8') as outFile:
        with open ('files/' + filename, 'r', encoding='UTF-8') as f:
            count = 0

            #Replace Each Line
            with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
                future_to_url = (executor.submit(findMatch, line, count) for line in f)
                for future in concurrent.futures.as_completed(future_to_url):
                    print(future)

FindMatch()

def findMatch(line, count):
    count = count + 1   #Keep track of lines for debugging
    #Check if match in line
    if(re.search(pattern1, line) != None):

        #Translate each match in line. Depends on choice
        for match in re.findall(pattern1, line):

            #Filter out matches with no Japanese
            if(re.search(pattern2, match) != None and '$' not in match):
                if(choice == '1'):
                    match = match.rstrip()
                    print('Translating: ' + str(count) + ': ' + match)
                    translatedMatch = translate(match)
                    line = re.sub(match, translatedMatch, line, 1)

                elif(choice == '2'):
                    match = match.rstrip()
                    print('Translating Line: ' + str(count))
                    line = translate(line)
                    break       #Don't want dupes

                else:
                    print('Bad Coder. Check your if statements')

        outFile.write(line)

    #Skip Line
    else:
        print('Skipping: ' + str(count))
        outFile.write(line)

Upvotes: 0

Views: 1317

Answers (2)

Aaron
Aaron

Reputation: 1368

To write lines in correct order,

  1. Use for future in future_to_url to iterate the futures in the submission order.
  2. Use list comprehension [execuotr.submit(...) for line in f] instead of generator expression (execuotr.submit(...) for line in f). All lines are submitted to the executor at once. Otherwise, tasks are submitted on-demand one-by-one while the loop is iterated, which is not parallelized.
  3. findMatch() return the result rather than write to the output directly.

When the call future.result() is made, it returns immediately the result if available, or block and wait the result.

import concurrent.futures
import os


def main():
    # Open File
    for filename in os.listdir("files"):
        with open('translate/' + filename, 'w', encoding='UTF-8') as outFile:
            with open('files/' + filename, 'r', encoding='UTF-8') as f:
                count = 0

                # Replace Each Line
                with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:

                    # The following submit all lines
                    future_to_url = [executor.submit(findMatch, line, count) for line in f]

                    # as_completed return arbitrary future when it is done
                    # Use simple for-loop ensure the future are iterated sequentially
                    for future in future_to_url:
                        print(future.result())
                        # Uncomment to actually write to the output
                        # outFile.write(future.result())


def findMatch(line, count):
    count = count + 1  # Keep track of lines for debugging
    # Check if match in line
    if (re.search(pattern1, line) != None):

        # Translate each match in line. Depends on choice
        for match in re.findall(pattern1, line):

            # Filter out matches with no Japanese
            if (re.search(pattern2, match) != None and '$' not in match):
                if (choice == '1'):
                    match = match.rstrip()
                    print('Translating: ' + str(count) + ': ' + match)
                    translatedMatch = translate(match)
                    line = re.sub(match, translatedMatch, line, 1)

                elif (choice == '2'):
                    match = match.rstrip()
                    print('Translating Line: ' + str(count))
                    line = translate(line)
                    break  # Don't want dupes

                else:
                    print('Bad Coder. Check your if statements')

        return line
    # Skip Line
    else:
        print('Skipping: ' + str(count))

        return line

Upvotes: 2

Frank Yellin
Frank Yellin

Reputation: 11332

I think the simplest solution would be for findMatch to take a string as an argument and return its translation as a string. Your main program would then be responsible for sorting all the translations and printing them out in order.

Attempting to synchronize multiple threads all writing to a single file is a big mess.

Upvotes: 2

Related Questions