How to import predefined decision tree and use it for classification

Question

Long story short

As an input I have a file with text representation of simple decision tree:

Region in [ "someregion" ]
    Revenue <= 1020.30
        group in [ "audio" ] => 123.456
        group in [ "disc" ] => 123.456
            volume <= 1 => 734.25
...

The program should import it as a classifier and be able to predict object's value. In other words, for object like following:

{"Region": "someregion", "Revenue": 100, "group": "disc", "volume": 0.5}

the prediction will be 734.25.

What existing decision tree implementations can I use to create a classifier? SciKit trees is almost the thing, but I didn't find a way to build a custom predefined trees instead of fitting on a dataset.

My attempt

For now I implement a simple tree parser:

import re

def parse_condition(row):
    # try with leaf regex
    condition = re.search(
        r'^(?P.*?) (?P.*?) (?P.*?)(?: => )(?P\d*\.\d*)',
        row
    ) or re.search(
        r'^(?P.*?) (?P.*?) (?P.*?)',
        row)
    return condition.groupdict()

f = open('tree.txt', 'r')

for row in f.readlines():
    level = len(re.search(r'^(	*)', row).group(0))
    row = row.strip()
    condition = parse_condition(row)
    el = (level, condition)
    print(el)

which extracts node level, condition and target value.

(0, {'field': 'Region', 'statement': 'in', 'value': ''})
(1, {'field': 'Revenue', 'statement': '<=', 'value': ''})
(2, {'field': 'group', 'statement': 'in', 'value': '[ "audio" ]', 'target': '123.456'})
(2, {'field': 'group', 'statement': 'in', 'value': '[ "disc" ]', 'target': '123.456'})
(3, {'field': 'volume', 'statement': '<=', 'value': '1', 'target': '734.25'})

Although I can develop a custom decision tree and condition parser from scratch, it seems like attempt to reinvent the wheel.

dani herrera · Accepted Answer

It exists a format named PMML, Predictive Model Markup Language. You can store decissions trees in this format to avoid to reinvent the wheel.

For example, knime software is able to deal with this format Example for Learning a Decision Tree. A PMML decision tree looks like this example:

Graphically it looks like this on Knime:

Then, the easy way to figure up results rom a PMML is using tree traversals. I posted on my githup repo lightpmmlpredictor an utility to do it. The core is a simple while traversing nodes using etree from lxml:

while True:

    try:        

        fill = next( e for e in Node 
                     if etree.QName(e).localname == 'Node' and
                        unicode(values[ e[0].get('field') ]) == e[0].get('value') )

        try:
            Node = fill
            predict = Node.get("score")
            n_tot = Node.get("recordCount")
            n_predict = max(  x.get( 'recordCount' ) 
                              for x in Node 
                              if etree.QName(x).localname == 'ScoreDistribution'   
                                 and x.get('value') == predict )
        except IndexError:
            break

        try:
            pct = float(n_predict) / float(n_tot)
        except:
            pct = 0.5
    except StopIteration:
        break

Be free to contribute to or fork my repo.

How to import predefined decision tree and use it for classification

Long story short

My attempt

Answers (1)

Related Questions