Reputation: 663
Is there a module for handling TMX(Translation Memory eXchange) files in python, if not, what would be another way to do it?
As it stands, I have a giant 2gb file with French-English subtitles. Would it be possible to even handle such a file or would I have to break it down?
Upvotes: 10
Views: 10674
Reputation: 122022
Here's a script that can easily convert TMX to pandas dataframe:
from collections import namedtuple
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup
def tmx2df(tmxfile):
# Pick your poison for parsing XML.
with open(tmxfile) as fin:
content = fin.read()
bsoup = BeautifulSoup(content, 'lxml') # Actual TMX extraction.
lol = [] # Keep a list of the rows to populate.
for tu in tqdm(bsoup.find_all('tu')):
# Parse metadata from tu
metadata = tu.attrs
# Parse prop
properties = {prop.attrs['type']:prop.text for prop in tu.find_all('prop')}
# Parse seg
segments = {}
# The order of the langauges might not be consistent,
# so keep them in some dict and unstructured first.
for tuv in tu.find_all('tuv'):
segment = ' '.join([seg.text for seg in tuv.find_all('seg')])
segments[tuv.attrs['xml:lang']] = segment
lol.append({'metadata':metadata, 'properties':properties, 'segments':segments}) # Put the list of rows into a dataframe.
df = pd.DataFrame(lol) # See https://stackoverflow.com/a/38231651
return pd.concat([df.drop(['segments'], axis=1), df['segments'].apply(pd.Series)], axis=1)
Upvotes: 0
Reputation: 12992
As @hurrial said, you can use translate-toolkit.
This toolkit is only available using pip. To install it, run:
pip install translate-toolkit
Assume that you have the following simple sample.tmx
file:
<tmx version="1.4">
<header
creationtool="XYZTool" creationtoolversion="1.01-023"
datatype="PlainText" segtype="sentence"
adminlang="en-us" srclang="en"
o-tmf="ABCTransMem"/>
<body>
<tu>
<tuv xml:lang="en">
<seg>Hello world!</seg>
</tuv>
<tuv xml:lang="ar">
<seg>اهلا بالعالم!</seg>
</tuv>
</tu>
</body>
</tmx>
You can parse this simple file like so:
>>> from translate.storage.tmx import tmxfile
>>>
>>> with open("sample.tmx", 'rb') as fin:
... tmx_file = tmxfile(fin, 'en', 'ar')
>>>
>>> for node in tmx_file.unit_iter():
... print(node.source, node.target)
Hello world! اهلا بالعالم!
For more info, check the official documentation from here.
Upvotes: 7
Reputation: 514
You may check the following links:
Cheers,
Upvotes: 2