Aravinth Rajan
Aravinth Rajan

Reputation: 113

issue while parsing XML data with BeautifulSoup package using python. need fresh eyes to guide

There is this XML data I have which needs to be parsed and certain information should be extracted. But, there is a catch when I am trying to extract the name field from the xml using beautifulSoup.

  1. Issue 1: I get the name of its parent as attribute-item instead of the data from the name field as "priority"
  2. Issue 2: I also need to extract the ID from the XML which is <attribute-item id="mydata.core.customization.requirements._noSpwIUSEei1hLMz9D9OBw">

I am using BeautifulSoup as the standard approach and can't change to any other package. Hence, workaround using the same would be much appreciated.

below is the XML data: data highlighted in bold requires to be extracted.

<configurations>
   <attributes-configuration>
      <attributes>
         <attribute-item id="mydata.core.customization.requirements._noSpwIUSEei1hLMz9D9OBw">
            <name>priority</name>
            <description>priority of a requirement</description>
            <customization-element>mydata.core.customization.requirements</customization-element>
            <attribute-type>mydata.attribute_type.list</attribute-type>
            <options>
               <option>
                  <key>DEFAULT_LIST</key>
                  <value class="java.lang.String"> high,low,medium</value>
               </option>
               <option>
                  <key>LIST_TYPE</key>
                  <value class="java.lang.String">CUSTOM</value>
               </option>
            </options>
            <editable>true</editable>
            <userDefined>true</userDefined>
            <internal>false</internal>
         </attribute-item>
         <attribute-item id="mydata.core.customization.teststep.prerequisite">
            <name>Prerequisite</name>
            <description>User Defined Attribute</description>
            <customization-element>mydata.core.customization.teststep</customization-element>
            <attribute-type>mydata.attribute_type.string</attribute-type>
            <options>
               <option>
                  <key>DEFAULT_VALUE</key>
                  <value/>
               </option>
               <option>
                  <key>MAX_CHARACTERS</key>
                  <value class="java.lang.String">5000</value>
               </option>
            </options>
            <editable>true</editable>
            <userDefined>true</userDefined>
            <internal>false</internal>
         </attribute-item>
      </attributes>
   </attributes-configuration>
   <test-management/>
</configurations>

Below is my python Code:

import os
from bs4 import BeautifulSoup  as bs  

fileName = 'Configuration.xml'
fullFile = os.path.abspath(os.path.join('DataTransporter', fileName))
attributeList = []
with open(fullFile) as f:
    soup = bs(f, 'xml')

for attribData in soup.find_all('attribute-item'):
    dat = {
            'attribName' : attribData.name,
            'attribDesc' : attribData.description.text,
            'attribValue' : attribData.options.value.text,
          }
    attributeList.append(dat)
    #for attribParams in soup.find_all(name = 'value'):
    #newdict[attribName.text] = attribParams.text
print(attributeList)

My Output:

[{'attribName': 'attribute-item', 'attribDesc': 'priority of a requirement', 'attribValue': ' high,low,medium'}, {'attribName': 'attribute-item', 'attribDesc': 'User Defined Attribute', 'attribValue': ''}]

Expected output:

[{'attribName': 'priority', 'attribDesc': 'priority of a requirement', 'attribValue': ' high,low,medium'}, {'attribName': 'prerequisite', 'attribDesc': 'User Defined Attribute', 'attribValue': ''}]

Upvotes: 0

Views: 50

Answers (1)

pazitos10
pazitos10

Reputation: 1709

At first I thought that using attribData.name.text should do it but it seems that 'name' is some kind of a keyword attribute for attribData. In order to get the correct values you could use the findChildren(<key>) method as follows:

attribData.findChildren('name')[0].text

findChildren() returns a list that in this case only has one value so it makes sense to use [0] to get the element and then .text to get the expected value.

To get the Id you could use attribData['id']. In summary, your code would look like this (inside the for loop):

dat = {
    'attribName' : attribData.findChildren('name')[0].text,
    'id': attribData['id'],
    'attribDesc' : attribData.description.text,
    'attribValue' : attribData.options.value.text,
}

The output would look like this:

[{'attribName': 'priority', 'id': 'mydata.core.customization.requirements._noSpwIUSEei1hLMz9D9OBw', 'attribDesc': 'priority of a requirement', 'attribValue': ' high,low,medium'}, {'attribName': 'Prerequisite', 'id': 'mydata.core.customization.teststep.prerequisite', 'attribDesc': 'User Defined Attribute', 'attribValue': ''}]

I hope it helps!

Upvotes: 1

Related Questions