Freestyle076
Freestyle076

Reputation: 1556

Infinite Recursion in Recursive Tree Builder

I've got an interesting problem with a recursive python function of mine. I must be missing some subtle base case or something! The objective of the code you'll see below is to build a decision tree (structure for determining data trends). It's being implemented by nesting python lists, that's what the return values and result list are doing. The structure itself isn't of that much importance (at least it seems so).

 def tdidt(dataset,attr_list,class_index,class_labels):
    print attr_list
    
    #base case 1: all the same class
    if all_same_class(dataset, class_index):
        class_label = majority_vote(dataset, class_index)
        return ['c',class_label]
    #base case 2: no more attributes
    elif attr_list == []:
        print 'i see and empty list!!!'
        class_label = majority_vote(dataset, class_index)
        return ['c',class_label]
    else:
        #else select attribute and partition dataset 
        attr = select_attribute(dataset, list(attr_list), class_index, class_labels)
        partitions = partition_on_attr(dataset, attr)
    
        #base case 3: one of the partitions is empty
        if has_empty_partition(partitions):
            class_label = majority_vote(dataset, class_index)
            return ['c',class_label]
        else:
            #enable respective data attribute labels for easier to read tree
            #result = result + ['a',titanic_header[attr]]
            #result = result + ['a',auto_header[attr]]
    
            #add attribute node to tree
            result = ['a',attr]
            #remove used attribute
            removed_attr_list = [x for x in list(attr_list) if x != attr]
    
            for partition in partitions.values():
                #add value node for tree, recursive call for next attributes
                result.append(["v",partition[0][attr],tdidt(partition, list(removed_attr_list), class_index, class_labels)]) 
        return result

attr_list (or something to do with attr_list) seems to be the the most likely suspect (see my complaints and the error messages below). attr_list contains a list of attribute indices, which I will "use" one by one by selecting one at a time from the list without replacement.

The three base cases for the recursive function are as follows:

1 (all the same class):

every row (list of attributes) in dataset (list of rows) shares a common value in the row indexed by class_index

2 (no more attributes):

the parameter attr_list is empty. This should be the most active base case, occurring most frequently.

3 (a partition is empty):

the data is partitioned by partition_on_attr(), one of these partitions (list of rows) is empty, cannot continue

recursive call is the longest line (the one causing the horizontal scroll bar) where the function tdidt() is called as a component of the appended list. The aim of the list comprehension assigned to removed_attr_list is to have an attribute list without the attribute we had just selected form attr_list.

ok, so as for the error I'm getting it's huge. I'm hitting the recursive limit, and for a really weird reason. I'm printing the attr_list immediately on function call and whenever an empty list is presented. Here's the output...

[0, 1, 2]
[0, 1]
[1]
[]
i see and empty list!!!
[] 
i see and empty list!!!
[1]
[]
i see and empty list!!!
[]
[1]
[1]
[1]
[1] 
[1]
[1]
.
.
.

Emphasis on the ellipsis. The [1]'s keep coming until recursion max out error.

RuntimeError: maximum recursion depth exceeded in cmp

The beginning behavior is exactly how I anticipate; the attr_list is used up, a few branches created from the for loop over the recursive call are seen, then the base case speaks up with "i see an empty list!", then the [1]'s start flowing!

Am I missing a recursive case? How does the attr_list get set back to [1] again and permanently? Any help is much appreciated, and thanks for making it to the bottom of this novel!

-----------------------------UPDATE 10/24/2014-----------------------------

When printing the dataset variable on each recursive call I see a similar behavior to the attr_list parameter. Things behave as expected and then the dataset parameter becomes this line over and over...

[['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'yes'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'yes'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'yes'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no'], ['third', 'adult', 'male', 'no']]

That's only one line! (And a peek at a weird mangling of some of the data)

How in the world is this happening? The function seems to spawn some kind of divide-by-zero meltdown once it is done with its process... my recursion must have something faulty.

-----------------------------UPDATE 10/25/2014-----------------------------

Ok, we're onto something now. @njlarsson suggested checking the behavior when the function partition_on_attr() returns the same dataset that it receives, meaning the chosen attribute is consistent through the dataset and no partition takes place. My original thoughts on this case is that the next recursive call would have a smaller attr_list, and eventually the empty attr_list base case would catch this partition. I am obviously mistaken, here are a couple of basic test cases I've run and the light that they share...

first play dataset has two instances, the first three attributes are the attributes to split on, fourth is the attribute we are trying to guess. Since all of the splitting attributes are unequal the case that the partition_on_attr() function returns the entire dataset as one partition cannot happen.

play = [[0,0,0,1],[1,1,1,0]]

output is a wonderful decision tree (not too aesthetically pleasing, but that's not the point)

['a', 0, ['v', 0, ['c', 1]], ['v', 1, ['c', 0]]]

now let's try a different play dataset, similar to the first play set - two instances, same schema.

play = [[1,1,1,1],[1,1,1,0]]

In this dataset the first three attributes, the attributes to be split on, are all equal. Thus the partition_by_attr() function HAS to return a single partition identical to the passed dataset. THE MAX RECURSION ERROR OCCURS!

@njlarsson you're onto something! I just don't see why this happens. What should the algorithm do in the case that the partition_on_attr function returns the entire dataset as one partition?

Upvotes: 0

Views: 1153

Answers (2)

Freestyle076
Freestyle076

Reputation: 1556

Found the problem! Unfortunately its rooted in the nature of decision trees and datasets rather than recursion. BUT, there was an edge recursive case being missed thanks to partition_on_attr(). As @njlarsson pointed out sometimes this function returns only one partition identical to the handed dataset in the case that the splitting attribute is consistent throughout the dataset.

The problem was that the partition_on_attr() function was unaware of all of the possible attribute values of the splitting attribute. So in my play dataset

play = [[1,1,1,1],[1,1,1,0]]

the partition function is unaware that either of the first three attributes could be 0 or 1. Thus it only returns one partition identical to the dataset when it should return two partitions, one holding the entire dataset (all instances with attribute values == 1) and an empty partition that holds any instances with attribute value == 0 had there been any.

In python dictionary notation (my method of creating partitions) this is what was being returned in the incorrect partition function:

{ 1: [[1,1,1,1], [1,1,1,0]] }

what is being returned by the correct partition function:

{ 0: [], 1: [[1,1,1,1], [1,1,1,0]] }

The empty partition for attribute value 0 is soooo important. Now base case #3 kicks in (partitions contains an empty partition) and the infinite recursion is halted.

We've done it! Thanks @njlarsson, you da man

Upvotes: 0

njlarsson
njlarsson

Reputation: 2340

I don't see anything in this code that produces infinite recursion on its own, I think the problem is elsewhere. Most likely, partition_on_attr produces a partition that is not strictly smaller than its dataset parameter. For instance, perhaps it produces a single partition equal to the parameter.

The first thing I'd check is what partition_on_attr does when the dataset has only one element. It could be that it returns a single partition containing that element, which would produce something like your output.

Upvotes: 1

Related Questions