downer
downer

Reputation: 964

Java pattern for configuring models in a data mining system

tl;dr:

I'm looking for the best pattern for enumerating variables for a particular object, and ranges these variables may take. I want to then configure objects based on a particular variable setting.

long version:

I'm going through some old code, trying to clean up some ugly hack I've made in the past. I have a nice machine learning and data mining library. In this library are a variety of statistical models (and other components) that can be learn many of their own parameters through mathematical optimization given enough data (called training). However, there are other parameters (hyper-parameters) that are set prior to training as one of the inputs. A hyper-parameter may be "tuned" by choosing many valid settings, building models for each, and picking the winner. Several hyper-parameters may be tuned using recursion on this process.

Problem:

It seems to me that the required components of an effective system for gracefully dealing with hyper-parameters (more generally, options) are:

  1. A static variable enumerating all different options of all different types (enum, float, boolean, etc), ranges of valid values, etc. These can also store default values for each option.
  2. A constructor that takes a configuration and builds an object with this option setting.
  3. nice to haves: the ability to make a "configuration" from a .properties, gnu cli, or yaml, for instance.

Difficulties i'm encountering:

one of the primary difficulties here seems to be 1). Java doesnt have any real mechanism for supporting static abstract variables prevents enforcing that a given class implementing the "Configurable" interface from storing their own default configuration implementation. is there a good way to get around this?

A parent's default configs should be passed to sub classes.

I can make a constructor that takes a configuration object, but extending this to also take (cli, yaml, .properties) manifestations of this configuration is a bit trickier.

I'd love to have any advice on tackling this problem that stackoverflow can provide. I've been thinking about this for some time, and all I have at the moment is ugly hacks, not beautiful code.

Upvotes: 1

Views: 225

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

You might want to have a look at how the data mining framework ELKI solves this. Judging from their wiki page on parameterization, they've gone through a couple of iterations. The current version seems to use plain java constructors, but a static public inner class that handles the parameterization stuff.

It can do a number of interesting things, such as returning an optimized implementation (e.g. when you use Lp-Norm with p=2, it will return the static instance of Euclidean distance). Plus, it won't throw an exception on the first parameterization error, but can report multiple errors in one configuration pass.

The MiniGUI UI they have has content assist (e.g. a dropdown of implementations or enum values), tooltips etc., and there also is a command line interface. It will also list valid parameter information, such as range constraints or available implementations.

I don't know if they also have a tool to automatically vary the parameters to find a local optimum. I think I at least saw some plans along these lines announced.

Upvotes: 2

Related Questions