Reputation: 329

Python: Parse dita map file and contents and output all href values

I'm a newbie Python programmer, and I was looking for a script or snippet to help. I have to parse a dita map/xml file and for every xml file, output that filename, open that file and search for referenced .dita, .ditamap, or .xml file, output their filename, and recurse into those files. The ideas is to output a file of all the files referenced by that .ditamap/xml file and its children. This file will feed a list for zipping that group of files to send for processing. I found some sample code but I get no output!

import os
import glob
root_dir ='~/test_folder'
for filename in glob.glob(root_dir + '**/*.xml', recursive=True)
    print(filename)

Here is a sample ditamap file:

<?xml version="1.0" encoding="utf-8"?><?Inspire CreateDate="2019-04-04T16:06:14" ModifiedDate="2022-11-11T16:44:57"?><!DOCTYPE bookmap PUBLIC "-//OASIS//DTD DITA BookMap//EN" "bookmap.dtd">

<bookmap id="bookmap_e90eb827-7421-4491-8df3-5fea34a44931" xml:lang="en-US">
    <booktitle id="booktitle_a78ddf49-09d7-4d3d-925c-d42d9ff7f360">
        <mainbooktitle id="mainbooktitle_0a34f716-bedc-4c5d-b198-dfd5006a3174">About the Documentation</mainbooktitle>
    </booktitle>
    <bookmeta>
        <prodinfo>
            <prodname />
            <vrmlist>
                <vrm version="1" />
            </vrmlist>
            <!--Do not change: Must be Manual-->
            <brand>Manual</brand>
        </prodinfo>
        <!--sets task labels (1st othermeta tag below)-->
        <othermeta content="yes" name="task-labels" />
        <othermeta content="about" name="bundle" />
        <bookid>
            <!--Revision-->
            <volume>A0X</volume>
        </bookid>
        <bookrights>
            <copyrfirst>
                <!--Format of copyright year is yyyy - mm-->
                <year>2019 - 04</year>
            </copyrfirst>
            <bookowner>
                <!--Do not change organization-->
                <organization>Dell</organization>
            </bookowner>
        </bookrights>
    </bookmeta>
    <chapter href="subjectscheme_6b1f4589-e73e-49be-806d-0d064f3efd01.xml" format="ditamap" outputclass="subjectscheme" processing-role="resource-only" scope="external" />
    <chapter href="atm-About_user_guide_891d23dc-a186-422d-af40-75249dd31f87.xml">
        <topicmeta>
            <navtitle>About the <keyword conref="lib-Boomi_Keywords_0346af2b-13d7-491e-bec9-18c5d89225bf.xml#GUID-0207C7F1-40FD-4537-BE59-1D6DA46B9A1D/BOOMI_DELL" /><keyword conref="lib-Boomi_Keywords_0346af2b-13d7-491e-bec9-18c5d89225bf.xml#GUID-0207C7F1-40FD-4537-BE59-1D6DA46B9A1D/BOOMI_ATOMSPHERE" /> User Guide</navtitle>
        </topicmeta>
        <topicref href="atm-Content_browsing_2c16a734-5cf8-416c-8978-0062ac04e430.xml">
            <topicmeta>
                <navtitle>Content browsing</navtitle>
            </topicmeta>
        </topicref>
        <topicref href="atm-Content_searching_acdba241-6d33-41bc-8886-0907906fed64.xml">
            <topicmeta>
                <navtitle>Content searching</navtitle>
                <othermeta name="mini-toc" content="yes" />
            </topicmeta>
        </topicref>
        <topicref href="atm-Creating_a_documentation_account_c4ddf038-e007-4ee3-bef9-9f4eb06d0f89.dita" />
        <topicref href="atm-Collections_of_your_favorite_topics_5dd10ed2-b689-4628-bc2c-bc35dd4f571e.xml">
            <topicref id="topicref_bb2f9a40-0266-44b5-a061-39eca24b5d41" href="atm-sharing_saved_collections_d41e734f-4b2e-4c1e-82e7-91617d1008ae.dita" navtitle="atm-Sharing_saved_collections" type="task" />
        </topicref>
        <topicref id="topicref_8a2ba548-6595-4cc5-af12-afa2631abfbb" href="atm-Using_table_filters_178c0de0-ddee-4073-b828-476ad13345c4.dita" type="task" />
        <topicref href="atm-Team_welcomes_your_feedback_848e635e-0132-43d8-b22d-bbdf87ca398a.xml">
            <topicmeta>
                <navtitle>The <keyword conref="lib-Boomi_Keywords_0346af2b-13d7-491e-bec9-18c5d89225bf.xml#GUID-0207C7F1-40FD-4537-BE59-1D6DA46B9A1D/BOOMI_ATOMSPHERE">The T</keyword> documentation team welcomes your feedback</navtitle>
            </topicmeta>
        </topicref>
        <topicref href="atm-Other_ways_to_get_help_09adc783-784f-4f15-87f9-672d8030b689.xml">
            <topicmeta>
                <navtitle>Other ways to get help</navtitle>
            </topicmeta>
        </topicref>
    </chapter>
    <chapter>
        <topicref href="atm-Terms_of_use_78ffba54-261d-428d-afcd-a9db3ce51123.dita" />
    </chapter>
    <chapter>
        <topicref>
            <topicref href="atm-API_licensing_df074d66-3a10-4df5-8dd5-0a3e13373d0e.dita" />
        </topicref>
    </chapter>
    <backmatter>
        <topicref href="r-boo-Copyright_Boomi_Online_Help_9eea563b-53a2-4d69-b6e7-7372bf7d5440.xml" navtitle="Copyright">
            <topicmeta>
                <navtitle>CopyrightBoomiOnlineHelp</navtitle>
            </topicmeta>
        </topicref>
        <topicref href="atm-About_reltable_72640fe6-ae6d-490c-b369-7adbcb67bc99.xml" linking="normal" print="no" toc="no">
            <topicmeta>
                <navtitle>reltable</navtitle>
            </topicmeta>
        </topicref>
    </backmatter>
</bookmap>

If anyone can help or have a similar script that would traverse and parse the files, that would be great!

Any help is greatly appreciated!

Thanks,

Russ

Upvotes: 0

Answers (2)

Hermann12

Reputation: 3581

As we discussed this code write a csv recursivly (Be carefully this program have no stopp condition as I asked you. It will stop only, maybe with Error if the first file without links will be found or the file can’t be found):

import xml.etree.ElementTree as ET

class Dita:
    """write a csv file with file name and included file list """
    def __init__(self, file):
        self.file_name = file
        self.file_list = []
             
    def parse_dita(self, file_name):
        tree = ET.parse(file_name)
        root = tree.getroot()
        
        file_list = []
        for elem in root.iter():
            if 'href' in elem.attrib:
                row = elem.get('href') #elem.tag,
                file_list.append(row)
        
        row = file_name, str(file_list),'\n'
        with open("f_and_links.csv", 'a') as f_and_links:
            f_and_links.writelines(row)
                    
        return file_list

def main():
    root_file = "bookmap.dita"
    print("Source file:", root_file)
    dita_obj = Dita(root_file)
    file_list = dita_obj.parse_dita(root_file)

    for f in file_list:
        print("Links list", f)
        dita_obj.parse_dita(f)
        
if __name__ == '__main__':
    main()

Upvotes: 0

Hermann12

Reputation: 3581

You can search for the href like:

import xml.etree.ElementTree as ET

tree = ET.parse("bookmap.dita")
root = tree.getroot()

for elem in root.iter():
    if 'href' in elem.attrib:
        # print tag name and file reference
        print(elem.tag, elem.get('href'))

Output:

chapter subjectscheme_6b1f4589-e73e-49be-806d-0d064f3efd01.xml
chapter atm-About_user_guide_891d23dc-a186-422d-af40-75249dd31f87.xml
topicref atm-Content_browsing_2c16a734-5cf8-416c-8978-0062ac04e430.xml
topicref atm-Content_searching_acdba241-6d33-41bc-8886-0907906fed64.xml
topicref atm-Creating_a_documentation_account_c4ddf038-e007-4ee3-bef9-9f4eb06d0f89.dita
topicref atm-Collections_of_your_favorite_topics_5dd10ed2-b689-4628-bc2c-bc35dd4f571e.xml
topicref atm-sharing_saved_collections_d41e734f-4b2e-4c1e-82e7-91617d1008ae.dita
topicref atm-Using_table_filters_178c0de0-ddee-4073-b828-476ad13345c4.dita
topicref atm-Team_welcomes_your_feedback_848e635e-0132-43d8-b22d-bbdf87ca398a.xml
topicref atm-Other_ways_to_get_help_09adc783-784f-4f15-87f9-672d8030b689.xml
topicref atm-Terms_of_use_78ffba54-261d-428d-afcd-a9db3ce51123.dita
topicref atm-API_licensing_df074d66-3a10-4df5-8dd5-0a3e13373d0e.dita
topicref r-boo-Copyright_Boomi_Online_Help_9eea563b-53a2-4d69-b6e7-7372bf7d5440.xml
topicref atm-About_reltable_72640fe6-ae6d-490c-b369-7adbcb67bc99.xml

Hope this helps you.

Upvotes: 0

Python: Parse dita map file and contents and output all href values

Answers (2)

Related Questions