Reputation: 884
I have two XML files that I'm trying to merge. I looked at other previous questions, but I don't feel like I can solve my problem from reading those. What I think makes my situation unique is that I have to find elements by attribute value and then merge to the opposite file.
I have two files. One is an English translation catalog and the second is a Japanese translation catalog. Pleas see below.
In the code below you'll see the XML has three elements which I will be merging children on - MessageCatalogueEntry, MessageCatalogueFormEntry, and MessageCatalogueFormItemEntry. I have hundreds of files and each file has thousands of lines. There may be more elements than the three I just listed, but I know for sure that all the elements have a "key" attribute.
My plan:
key_values = [321, 260, 320]
key=321
.key=321
from File 1.key=321
and add the child element I previously grabbed from File 1.key_values
list.File 1:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue []>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="AddKMRichSearchEngineAdmin_AutoTranslationCatalogue" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="" message="active"/>
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="" shortTitle="Configuration" title="Spider Configuration"/>
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="Manage Recognised Phrases" locale="" mnemonic="" scriptText=""/>
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
File 2:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue[]>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="ja" message="アクティブ" />
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="ja" shortTitle="設定" title="Spider Configuration/スパイダー設定" />
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="認識されたフレーズを管理" locale="ja" mnemonic="" scriptText="" />
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
Output:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue []>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="AddKMRichSearchEngineAdmin_AutoTranslationCatalogue" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="" message="active"/>
<MessageCatalogueEntry_loc locale="ja" message="アクティブ" />
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="" shortTitle="Configuration" title="Spider Configuration"/>
<MessageCatalogueFormEntry_loc locale="ja" shortTitle="設定" title="Spider Configuration/スパイダー設定" />
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="Manage Recognised Phrases" locale="" mnemonic="" scriptText=""/>
<MessageCatalogueFormItemEntry_loc hintText="" label="認識されたフレーズを管理" locale="ja" mnemonic="" scriptText="" />
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
I'm having trouble just even grabbing elements, nevermind grabbing them by key value. For example, I've been playing with the elementtree library and I wrote this code hoping to get just the MessageCatalogueEntry but I'm only getting their children:
from xml.etree import ElementTree as et
tree_japanese = et.parse('C:\\blah\\blah\\blah\\AddKMRichSearchEngineAdmin_AutoTranslationCatalogue_JA.xml')
root_japanese = tree_japanese.getroot()
MC_japanese = root_japanese.findall("MessageCatalogue")
for x in MC_japanese:
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
tree_english = et.parse('C:\\blah\\blah\\blah\\AddKMRichSearchEngineAdmin\\AddKMRichSearchEngineAdmin_AutoTranslationCatalogue.xml')
root_english = tree_english.getroot()
MC_english = root_english.findall("MessageCatalogue")
for x in MC_english:
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
Any help would be appreciated. I've been at this for a few work days now and I'm not any closer to finishing than I was when I first started!
Upvotes: 0
Views: 3286
Reputation: 77357
Actually, you are getting the MessageCatalogEntry's. The problem is in the print statement. An element acts like a list, so m[0]
is the first child of the MessageCatalogEntry. In
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
change the print to print et.tostring(m, encoding='utf8')
to see the right element.
I personally prefer lxml to elementtree. Assuming you want to associate entries by the 'key' attribute, you could use xpath to index one of the docs and then pull them into other doc.
import lxml.etree
tree_english = lxml.etree.parse('english.xml')
tree_japanese = lxml.etree.parse('japanese.xml')
# index the japanese catalog
j_index = {}
for catalog in tree_japanese.xpath('MessageCatalogue/*[@key]'):
j_index[catalog.get('key')] = catalog
# find catalog entries in english and merge the japanese
for catalog in tree_english.xpath('MessageCatalogue/*[@key]'):
j_catalog = j_index.get(catalog.get('key'))
if j_catalog is not None:
print 'found match'
for child in j_catalog:
print 'add one'
catalog.append(child)
print lxml.etree.tostring(tree_english, pretty_print=True, encoding='utf8')
Upvotes: 2