Chameleon
Chameleon

Reputation: 10128

Is there any Python package which allows parsing .po files (messages, including context comments)?

I need to merge/update/delete .po files messages and need some Python package which allow me to parse fully .po files including: messages, plurals, location, context and comments.

I want do simple tool which will check differences between to files. I could use also some GUI is already done but not sure if there such tool which will add new translation or remove not used translation.

I was search some articles but not found how to do it. Please recommend some Python package which parse .po fully (could other language) or tool to do such important task to keep good translations.

Upvotes: 4

Views: 2159

Answers (3)

carl_kibler
carl_kibler

Reputation: 79

The polib package is very good. It parses the file and presents several ways to access the data, including an iterator to loop through the msgid, msgstr pairs to do whatever you need. Here is the Quick Start documentation.

It can also parse a .mo if the .po isn't available, specially handle obsolete message strings, iterate over only translated strings, and other nice features.

Upvotes: 6

Roland Smith
Roland Smith

Reputation: 43495

You don't need fancy tools to read .po files; they are plain text files which basically contain message/translation pairs:

#: buttons.c:425
msgid "Extra"
msgstr "Thêm"

#: buttons.c:433
msgid "Help"
msgstr "Trợ giúp"

For a simple tool to compare them, I would suggest using diff -u.

There is a binary format with the .mo extension. You can use the msgunfmt program from the gettext-tools package to convert them back into plain text.

Extracting id/translation pairs from .po files is not difficult:

In [1]: po = '''#: buttons.c:425
   ...: msgid "Extra"
   ...: msgstr "Thêm"
   ...: 
   ...: #: buttons.c:433
   ...: msgid "Help"
   ...: msgstr "Trợ giúp"
   ...: 
   ...: '''

In [2]: import re

In [3]: re.findall('^msgid \"(.*)\"', po, re.MULTILINE)
Out[3]: ['Extra', 'Help']

In [4]: re.findall('^msgstr \"(.*)\"', po, re.MULTILINE)
Out[4]: ['Th\xc3\xaam', 'Tr\xe1\xbb\xa3 gi\xc3\xbap']

In [5]: zip(re.findall('^msgid \"(.*)\"[^\"]*', po, re.MULTILINE), re.findall('^msgstr \"(.*)\"[^\"]*', po, re.MULTILINE))
Out[5]: [('Extra', 'Th\xc3\xaam'), ('Help', 'Tr\xe1\xbb\xa3 gi\xc3\xbap')]

I'm using the ^ and re.MULTILINE to prevent commented-out messages to turn up here. As a sanity check, make sure that the list containing the message-ids and the message strings are of equal length.

Edit: You have a valid point regarding the random orderering and using diff. But you could use the code above to make lists of (message-id, translation) tuples for both the old and new .po file. If you sort those lists by the message-id you could use difflib.unified_diff to print the differences.

For example:

In [1]: import re, itertools, difflib

#I've used cpaste to input two pieces of a .po file, the latter with some changes

In [4]: orig_po
Out[4]: '#: mixedgauge.c:64\nmsgid "Passed"\nmsgstr "\xc4\x90\xe1\xbb\x97"\n\n#: mixedgauge.c:67\nmsgid "Completed"\nmsgstr "Ho\xc3\xa0n to\xc3\xa0n"\n\n#: mixedgauge.c:70\nmsgid "Checked"\nmsgstr "\xc4\x90\xc3\xa3 ki\xe1\xbb\x83m tra"\n\n#: mixedgauge.c:73\nmsgid "Done"\nmsgstr "Ho\xc3\xa0n t\xe1\xba\xa5t"\n\n#: mixedgauge.c:76\nmsgid "Skipped"\nmsgstr "B\xe1\xbb\x8b b\xe1\xbb\x8f qua"\n\n#: mixedgauge.c:79\nmsgid "In Progress"\nmsgstr "\xc4\x90ang ch\xe1\xba\xa1y"\n\n#: mixedgauge.c:85\nmsgid "N/A"\nmsgstr "Kh\xc3\xb4ng c\xc3\xb3"\n\n#: mixedgauge.c:193\nmsgid "Overall Progress"\nmsgstr "To\xc3\xa0n ti\xe1\xba\xbfn h\xc3\xa0nh"\n'

In [5]: changed_po
Out[5]: '#: mixedgauge.c:64\nmsgid "Passed"\nmsgstr "\xc4\x90\xe1\xbb\x97"\n\n#: mixedgauge.c:193\nmsgid "Overall Progres"\nmsgstr "To\xc3\xa0n ti\xe1\xba\xbfn h\xc3\xa0nh"\n\n#: mixedgauge.c:67\nmsgid "Completed"\nmsgstr "Ho\xc3\xa0na to\xc3\xa0n"\n\n#: mixedgauge.c:76\nmsgid "Skipped"\nmsgstr "B\xe1\xbb\x8b b\xe1\xbb\x8f qua"\n\n#: mixedgauge.c:79\nmsgid "In Progress"\nmsgstr "\xc4\x90ang ch\xe1\xba\xa1y"\n\n#: mixedgauge.c:85\nmsgid "N/A"\nmsgstr "Kh\xc3\xb4ng c\xc3\xb3e"\n\n#: mixedgauge.c:70\nmsgid "Checked"\nmsgstr "\xc4\x90\xc3\xa3 ki\xe1\xbb\x83m tra"\n\n#: mixedgauge.c:73\nmsgid "Done"\nmsgstr "Ho\xc3\xa0n t\xe1\xba\xa5t"\n'

# Making a list of tuples

In [6]: orig_list = zip(re.findall('^(msgid \".*\")', orig_po, re.MULTILINE), re.findall('^(msgstr \".*\")', orig_po, re.MULTILINE))

In [7]: changed_list = zip(re.findall('^(msgid \".*\")', changed_po, re.MULTILINE), re.findall('^(msgstr \".*\")', changed_po, re.MULTILINE))

# Sort them by the message-id

In [8]: orig_list.sort(key=lambda t: t[0])

In [9]: changed_list.sort(key=lambda t: t[0])

# Now flatten the list

In [10]: orig_string_list = [i for i in itertools.chain(*orig_list)]

In [11]: changed_string_list = [i for i in itertools.chain(*changed_list)]

In [12]: orig_list[0:3]
Out[12]: [('msgid "Checked"', 'msgstr "\xc4\x90\xc3\xa3 ki\xe1\xbb\x83m tra"'), ('msgid "Completed"', 'msgstr "Ho\xc3\xa0n to\xc3\xa0n"'), ('msgid "Done"', 'msgstr "Ho\xc3\xa0n t\xe1\xba\xa5t"')]

In [13]: orig_string_list[0:6]
Out[13]: ['msgid "Checked"', 'msgstr "\xc4\x90\xc3\xa3 ki\xe1\xbb\x83m tra"', 'msgid "Completed"', 'msgstr "Ho\xc3\xa0n to\xc3\xa0n"', 'msgid "Done"', 'msgstr "Ho\xc3\xa0n t\xe1\xba\xa5t"']

# print the diff

In [14]: for l in difflib.unified_diff(orig_string_list, changed_string_list, fromfile='original', tofile='changed'):
   ....:     print l
   ....:     
--- original

+++ changed

@@ -1,14 +1,14 @@

 msgid "Checked"
 msgstr "Đã kiểm tra"
 msgid "Completed"
-msgstr "Hoàn toàn"
+msgstr "Hoàna toàn"
 msgid "Done"
 msgstr "Hoàn tất"
 msgid "In Progress"
 msgstr "Đang chạy"
 msgid "N/A"
-msgstr "Không có"
-msgid "Overall Progress"
+msgstr "Không cóe"
+msgid "Overall Progres"
 msgstr "Toàn tiến hành"
 msgid "Passed"
 msgstr "Đỗ"

Upvotes: 2

wRAR
wRAR

Reputation: 25693

Try the babel module. it includes a .po parser in babel.messages.catalog and babel.messages.pofile among other things.

Upvotes: 2

Related Questions