Reputation: 329
I'm a newbie Python programmer, and I was looking for a script or snippet to help. I have to parse a dita map/xml file and for every xml file, output that filename, open that file and search for referenced .dita
, .ditamap
, or .xml
file, output their filename, and recurse into those files. The ideas is to output a file of all the files referenced by that .ditamap/xml
file and its children. This file will feed a list for zipping that group of files to send for processing.
I found some sample code but I get no output!
import os
import glob
root_dir ='~/test_folder'
for filename in glob.glob(root_dir + '**/*.xml', recursive=True)
print(filename)
Here is a sample ditamap file:
<?xml version="1.0" encoding="utf-8"?><?Inspire CreateDate="2019-04-04T16:06:14" ModifiedDate="2022-11-11T16:44:57"?><!DOCTYPE bookmap PUBLIC "-//OASIS//DTD DITA BookMap//EN" "bookmap.dtd">
<bookmap id="bookmap_e90eb827-7421-4491-8df3-5fea34a44931" xml:lang="en-US">
<booktitle id="booktitle_a78ddf49-09d7-4d3d-925c-d42d9ff7f360">
<mainbooktitle id="mainbooktitle_0a34f716-bedc-4c5d-b198-dfd5006a3174">About the Documentation</mainbooktitle>
</booktitle>
<bookmeta>
<prodinfo>
<prodname />
<vrmlist>
<vrm version="1" />
</vrmlist>
<!--Do not change: Must be Manual-->
<brand>Manual</brand>
</prodinfo>
<!--sets task labels (1st othermeta tag below)-->
<othermeta content="yes" name="task-labels" />
<othermeta content="about" name="bundle" />
<bookid>
<!--Revision-->
<volume>A0X</volume>
</bookid>
<bookrights>
<copyrfirst>
<!--Format of copyright year is yyyy - mm-->
<year>2019 - 04</year>
</copyrfirst>
<bookowner>
<!--Do not change organization-->
<organization>Dell</organization>
</bookowner>
</bookrights>
</bookmeta>
<chapter href="subjectscheme_6b1f4589-e73e-49be-806d-0d064f3efd01.xml" format="ditamap" outputclass="subjectscheme" processing-role="resource-only" scope="external" />
<chapter href="atm-About_user_guide_891d23dc-a186-422d-af40-75249dd31f87.xml">
<topicmeta>
<navtitle>About the <keyword conref="lib-Boomi_Keywords_0346af2b-13d7-491e-bec9-18c5d89225bf.xml#GUID-0207C7F1-40FD-4537-BE59-1D6DA46B9A1D/BOOMI_DELL" /><keyword conref="lib-Boomi_Keywords_0346af2b-13d7-491e-bec9-18c5d89225bf.xml#GUID-0207C7F1-40FD-4537-BE59-1D6DA46B9A1D/BOOMI_ATOMSPHERE" /> User Guide</navtitle>
</topicmeta>
<topicref href="atm-Content_browsing_2c16a734-5cf8-416c-8978-0062ac04e430.xml">
<topicmeta>
<navtitle>Content browsing</navtitle>
</topicmeta>
</topicref>
<topicref href="atm-Content_searching_acdba241-6d33-41bc-8886-0907906fed64.xml">
<topicmeta>
<navtitle>Content searching</navtitle>
<othermeta name="mini-toc" content="yes" />
</topicmeta>
</topicref>
<topicref href="atm-Creating_a_documentation_account_c4ddf038-e007-4ee3-bef9-9f4eb06d0f89.dita" />
<topicref href="atm-Collections_of_your_favorite_topics_5dd10ed2-b689-4628-bc2c-bc35dd4f571e.xml">
<topicref id="topicref_bb2f9a40-0266-44b5-a061-39eca24b5d41" href="atm-sharing_saved_collections_d41e734f-4b2e-4c1e-82e7-91617d1008ae.dita" navtitle="atm-Sharing_saved_collections" type="task" />
</topicref>
<topicref id="topicref_8a2ba548-6595-4cc5-af12-afa2631abfbb" href="atm-Using_table_filters_178c0de0-ddee-4073-b828-476ad13345c4.dita" type="task" />
<topicref href="atm-Team_welcomes_your_feedback_848e635e-0132-43d8-b22d-bbdf87ca398a.xml">
<topicmeta>
<navtitle>The <keyword conref="lib-Boomi_Keywords_0346af2b-13d7-491e-bec9-18c5d89225bf.xml#GUID-0207C7F1-40FD-4537-BE59-1D6DA46B9A1D/BOOMI_ATOMSPHERE">The T</keyword> documentation team welcomes your feedback</navtitle>
</topicmeta>
</topicref>
<topicref href="atm-Other_ways_to_get_help_09adc783-784f-4f15-87f9-672d8030b689.xml">
<topicmeta>
<navtitle>Other ways to get help</navtitle>
</topicmeta>
</topicref>
</chapter>
<chapter>
<topicref href="atm-Terms_of_use_78ffba54-261d-428d-afcd-a9db3ce51123.dita" />
</chapter>
<chapter>
<topicref>
<topicref href="atm-API_licensing_df074d66-3a10-4df5-8dd5-0a3e13373d0e.dita" />
</topicref>
</chapter>
<backmatter>
<topicref href="r-boo-Copyright_Boomi_Online_Help_9eea563b-53a2-4d69-b6e7-7372bf7d5440.xml" navtitle="Copyright">
<topicmeta>
<navtitle>CopyrightBoomiOnlineHelp</navtitle>
</topicmeta>
</topicref>
<topicref href="atm-About_reltable_72640fe6-ae6d-490c-b369-7adbcb67bc99.xml" linking="normal" print="no" toc="no">
<topicmeta>
<navtitle>reltable</navtitle>
</topicmeta>
</topicref>
</backmatter>
</bookmap>
If anyone can help or have a similar script that would traverse and parse the files, that would be great!
Any help is greatly appreciated!
Thanks,
Russ
Upvotes: 0
Views: 794
Reputation: 3581
As we discussed this code write a csv recursivly (Be carefully this program have no stopp condition as I asked you. It will stop only, maybe with Error if the first file without links will be found or the file can’t be found):
import xml.etree.ElementTree as ET
class Dita:
"""write a csv file with file name and included file list """
def __init__(self, file):
self.file_name = file
self.file_list = []
def parse_dita(self, file_name):
tree = ET.parse(file_name)
root = tree.getroot()
file_list = []
for elem in root.iter():
if 'href' in elem.attrib:
row = elem.get('href') #elem.tag,
file_list.append(row)
row = file_name, str(file_list),'\n'
with open("f_and_links.csv", 'a') as f_and_links:
f_and_links.writelines(row)
return file_list
def main():
root_file = "bookmap.dita"
print("Source file:", root_file)
dita_obj = Dita(root_file)
file_list = dita_obj.parse_dita(root_file)
for f in file_list:
print("Links list", f)
dita_obj.parse_dita(f)
if __name__ == '__main__':
main()
Upvotes: 0
Reputation: 3581
You can search for the href
like:
import xml.etree.ElementTree as ET
tree = ET.parse("bookmap.dita")
root = tree.getroot()
for elem in root.iter():
if 'href' in elem.attrib:
# print tag name and file reference
print(elem.tag, elem.get('href'))
Output:
chapter subjectscheme_6b1f4589-e73e-49be-806d-0d064f3efd01.xml
chapter atm-About_user_guide_891d23dc-a186-422d-af40-75249dd31f87.xml
topicref atm-Content_browsing_2c16a734-5cf8-416c-8978-0062ac04e430.xml
topicref atm-Content_searching_acdba241-6d33-41bc-8886-0907906fed64.xml
topicref atm-Creating_a_documentation_account_c4ddf038-e007-4ee3-bef9-9f4eb06d0f89.dita
topicref atm-Collections_of_your_favorite_topics_5dd10ed2-b689-4628-bc2c-bc35dd4f571e.xml
topicref atm-sharing_saved_collections_d41e734f-4b2e-4c1e-82e7-91617d1008ae.dita
topicref atm-Using_table_filters_178c0de0-ddee-4073-b828-476ad13345c4.dita
topicref atm-Team_welcomes_your_feedback_848e635e-0132-43d8-b22d-bbdf87ca398a.xml
topicref atm-Other_ways_to_get_help_09adc783-784f-4f15-87f9-672d8030b689.xml
topicref atm-Terms_of_use_78ffba54-261d-428d-afcd-a9db3ce51123.dita
topicref atm-API_licensing_df074d66-3a10-4df5-8dd5-0a3e13373d0e.dita
topicref r-boo-Copyright_Boomi_Online_Help_9eea563b-53a2-4d69-b6e7-7372bf7d5440.xml
topicref atm-About_reltable_72640fe6-ae6d-490c-b369-7adbcb67bc99.xml
Hope this helps you.
Upvotes: 0