Giampaolo Levorato
Giampaolo Levorato

Reputation: 1622

Parse xml file in pandas

I have this xml file (it's called "LogReg.xml" and it contains some information about a logistic regression (I am interested in the name of the features and their coefficient - I'll explain in more detail below):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
    <Header>
        <Application name="JPMML-SkLearn" version="1.6.35"/>
        <Timestamp>2022-02-15T09:44:54Z</Timestamp>
    </Header>
    <MiningBuildTask>
        <Extension name="repr">PMMLPipeline(steps=[('classifier', LogisticRegression())])</Extension>
    </MiningBuildTask>
    <DataDictionary>
        <DataField name="Target" optype="categorical" dataType="integer">
            <Value value="0"/>
            <Value value="1"/>
        </DataField>
        <DataField name="const" optype="continuous" dataType="double"/>
        <DataField name="grade" optype="continuous" dataType="double"/>
        <DataField name="emp_length" optype="continuous" dataType="double"/>
        <DataField name="dti" optype="continuous" dataType="double"/>
        <DataField name="Orig_FicoScore" optype="continuous" dataType="double"/>
        <DataField name="inq_last_6mths" optype="continuous" dataType="double"/>
        <DataField name="acc_open_past_24mths" optype="continuous" dataType="double"/>
        <DataField name="mort_acc" optype="continuous" dataType="double"/>
        <DataField name="mths_since_recent_bc" optype="continuous" dataType="double"/>
        <DataField name="num_rev_tl_bal_gt_0" optype="continuous" dataType="double"/>
        <DataField name="percent_bc_gt_75" optype="continuous" dataType="double"/>
    </DataDictionary>
    <RegressionModel functionName="classification" algorithmName="sklearn.linear_model._logistic.LogisticRegression" normalizationMethod="logit">
        <MiningSchema>
            <MiningField name="Target" usageType="target"/>
            <MiningField name="const"/>
            <MiningField name="grade"/>
            <MiningField name="emp_length"/>
            <MiningField name="dti"/>
            <MiningField name="Orig_FicoScore"/>
            <MiningField name="inq_last_6mths"/>
            <MiningField name="acc_open_past_24mths"/>
            <MiningField name="mort_acc"/>
            <MiningField name="mths_since_recent_bc"/>
            <MiningField name="num_rev_tl_bal_gt_0"/>
            <MiningField name="percent_bc_gt_75"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
            <OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
        </Output>
        <RegressionTable intercept="0.8064694059338298" targetCategory="1">
            <NumericPredictor name="const" coefficient="0.8013433785974717"/>
            <NumericPredictor name="grade" coefficient="0.9010481046582982"/>
            <NumericPredictor name="emp_length" coefficient="0.9460686056314133"/>
            <NumericPredictor name="dti" coefficient="0.5117062988491518"/>
            <NumericPredictor name="Orig_FicoScore" coefficient="0.07944303372859234"/>
            <NumericPredictor name="inq_last_6mths" coefficient="0.20516234445402765"/>
            <NumericPredictor name="acc_open_past_24mths" coefficient="0.4852503249658917"/>
            <NumericPredictor name="mort_acc" coefficient="0.6673203078463711"/>
            <NumericPredictor name="mths_since_recent_bc" coefficient="0.1962158305958366"/>
            <NumericPredictor name="num_rev_tl_bal_gt_0" coefficient="0.12964661294856686"/>
            <NumericPredictor name="percent_bc_gt_75" coefficient="0.04534570018290847"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="0"/>
    </RegressionModel>
</PMML>

I have parsed it using this code:

from lxml import objectify

path = 'LogReg.xml'

parsed = objectify.parse(open(path))
root = parsed.getroot()

data = []

if True:
    for elt in root.RegressionModel.RegressionTable:
        el_data = {}
        for child in elt.getchildren():
            el_data[child.tag] = child.text
        data.append(el_data)

perf = pd.DataFrame(data)   

I am interested in parsing this bit:

    <RegressionTable intercept="0.8064694059338298" targetCategory="1">
        <NumericPredictor name="const" coefficient="0.8013433785974717"/>
        <NumericPredictor name="grade" coefficient="0.9010481046582982"/>
        <NumericPredictor name="emp_length" coefficient="0.9460686056314133"/>
        <NumericPredictor name="dti" coefficient="0.5117062988491518"/>
        <NumericPredictor name="Orig_FicoScore" coefficient="0.07944303372859234"/>
        <NumericPredictor name="inq_last_6mths" coefficient="0.20516234445402765"/>
        <NumericPredictor name="acc_open_past_24mths" coefficient="0.4852503249658917"/>
        <NumericPredictor name="mort_acc" coefficient="0.6673203078463711"/>
        <NumericPredictor name="mths_since_recent_bc" coefficient="0.1962158305958366"/>
        <NumericPredictor name="num_rev_tl_bal_gt_0" coefficient="0.12964661294856686"/>
        <NumericPredictor name="percent_bc_gt_75" coefficient="0.04534570018290847"/>
    </RegressionTable>

so that I can build the following dictionary:

myDict = {
"const : 0.8013433785974717,
"grade" : 0.9010481046582982,
"emp_length" : 0.9460686056314133,
"dti" : 0.5117062988491518,
"Orig_FicoScore" : 0.07944303372859234,
"inq_last_6mths" : 0.20516234445402765,
"acc_open_past_24mths" : 0.4852503249658917,
"mort_acc" : 0.6673203078463711,
"mths_since_recent_bc" : 0.1962158305958366,
"num_rev_tl_bal_gt_0" : 0.12964661294856686,
"percent_bc_gt_75" : 0.04534570018290847
}

Basically, in the dictionary the Key is the name of the feature and the value is the coefficient of the logistic regression.

Please can anyone help me with the code?

Upvotes: 1

Views: 47

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24940

I'm not sure you need pandas for this, but you do need to handle the namespaces in your xml.

Try something along these lines:

myDict = {}
#register the namespace
ns = {'xx': 'http://www.dmg.org/PMML-4_4'}

#you could collapse the next two into one line, but I believe it's clearer this way
rt = root.xpath('//xx:RegressionTable[.//xx:NumericPredictor]',namespaces=ns)[0]
nps = rt.xpath('./xx:NumericPredictor',namespaces=ns)

for np in nps:
    myDict[np.attrib['name']]=np.attrib['coefficient']
myDict

The output should be your expected output.

Upvotes: 1

Related Questions