C4.5 algorithm with unbounded attributes

Question

Current implementation of C4.5 in VFDT (http://www.cs.washington.edu/dm/vfml/vfdt.html) or for that matter any other implementation uses the C4.5 format of files for providing inputs for constructing the decision tree. According to this the attributes can have the following formats:

continuous If the attribute has a continuous value.

discrete The word 'discrete' followed by an integer which indicates how many values the attribute can take.

list of identifiers This is a discrete attribute with the values enumerated (this is the prefered method for discrete attributes). The identifiers should be separated by commas.

ignore means the attribute should be ignored - it won't be used.

Does anybody know how we can specify discrete valued attributes whose complete set of possible values is too large to list down?

For example "IP-Address" attribute can have Math.Pow(255,4) possible discrete values; "QueryString" attribute can have infinite number of possible values ... etc.

Can the C4.5 algorithm handle the case where the attribute has say 100,000 discrete distinct values, OR where the exact bound is not known, but only an approximation is known?

Thanks.

Fred Foo · Accepted Answer

The usual choice is to enumerate all the values of a discrete feature that occur in your training set. Since the algorithm can never gather enough statistics for values that are not seen during training, those would be ignored no matter how you'd implement them.

Mind you, it's quite hard to gather statistics for such features anyway, so you might want to think about different representations. In particular, multi-word strings of text can be tokenized and treated as bags of words.

C4.5 algorithm with unbounded attributes

Answers (1)

Related Questions