Reputation: 6938
From a Python script, I need to call a PL->EN translation service. The translation requires 3 steps: tokenization, translation, detoknization
From Linux, I can achieve this using 3 processes by the following commands executed in mentioned order:
/home/nlp/opt/moses/scripts/tokenizer/tokenizer.perl -l pl < path_to_input.txt > path_to_output.tok.txt
/home/nlp/opt/moses/bin/moses -f /home/nlp/Downloads/TED/tuning/moses.tuned.ini.1 -drop-unknown -input-file path_to_output.tok.txt -th 8 > path_to_output.trans.txt
/home/nlp/opt/moses/scripts/tokenizer/detokenizer.perl -l en < path_to_output.trans.txt > path_to_output.final.txt
which translates the file path_to_input.txt
and outputs to path_to_output.final.txt
I have made the following script for combining the 3 processes:
import shlex
import subprocess
from subprocess import STDOUT,PIPE
import os
import socket
class Translator:
@staticmethod
def pl_to_en(input_file, output_file):
# Tokenize
print("Tokenization started")
with open("tokenized.txt", "w+") as tokenizer_output:
with open(input_file) as tokenizer_input:
cmd = "/home/nlp/opt/moses/scripts/tokenizer/tokenizer.perl - l pl"
args = shlex.split(cmd)
p = subprocess.Popen(args, stdin=tokenizer_input, stdout=tokenizer_output)
p.wait()
print("Tokenization finished")
#Translate
print("Translation started")
with open("translated.txt", "w+") as translator_output:
cmd = "/home/nlp/opt/moses/bin/moses -f /home/nlp/Downloads/TED/tuning/moses.tuned.ini.1 -drop-unknown -input-file tokenized.txt -th 8"
args = shlex.split(cmd)
p = subprocess.Popen(args, stdout=translator_output)
p.wait()
print("Translation finished")
# Detokenize
print("Detokenization started")
with open("translated.txt") as detokenizer_input:
with open("detokenized.txt", "w+") as detokenizer_output:
cmd = "/home/nlp/opt/moses/scripts/tokenizer/detokenizer.perl -l en"
args = shlex.split(cmd)
p = subprocess.Popen(args, stdin=detokenizer_input, stdout=detokenizer_output)
p.wait()
print("Detokenization finished")
translator = Translator()
translator.pl_to_en("some_input_file.txt", "some_output_file.txt")
But only the tokenization part works.
The translator just outputs an empty file translated.txt
. When looking at the output in the terminal, it looks like the translator loads the file tokenized.txt correctly, and does a translation. The problem is just how I collect the output from that process.
Upvotes: 0
Views: 117
Reputation: 5875
I would try something like the following - sending the output of the translator process to the pipe, and making the input of the detokenizer the pipe instead of using the files.
import shlex
import subprocess
from subprocess import STDOUT,PIPE
import os
import socket
class Translator:
@staticmethod
def pl_to_en(input_file, output_file):
# Tokenize
print("Tokenization started")
with open("tokenized.txt", "w+") as tokenizer_output:
with open(input_file) as tokenizer_input:
cmd = "/home/nlp/opt/moses/scripts/tokenizer/tokenizer.perl - l pl"
args = shlex.split(cmd)
p = subprocess.Popen(args, stdin=tokenizer_input, stdout=tokenizer_output)
p.wait()
print("Tokenization finished")
#Translate
print("Translation started")
cmd = "/home/nlp/opt/moses/bin/moses -f /home/nlp/Downloads/TED/tuning/moses.tuned.ini.1 -drop-unknown -input-file tokenized.txt -th 8"
args = shlex.split(cmd)
translate_p = subprocess.Popen(args, stdout=subprocess.PIPE)
translate_p.wait()
print("Translation finished")
# Detokenize
print("Detokenization started")
with open("detokenized.txt", "w+") as detokenizer_output:
cmd = "/home/nlp/opt/moses/scripts/tokenizer/detokenizer.perl -l en"
args = shlex.split(cmd)
detokenizer_p = subprocess.Popen(args, stdin=translate_p.stdout, stdout=detokenizer_output)
detokenizer_p.wait()
print("Detokenization finished")
translator = Translator()
translator.pl_to_en("some_input_file.txt", "some_output_file.txt")
Upvotes: 1