Classifcation/Decision Trees and Choosing Splits

Question

This is a very basic example. But I am doing some data analysis and am continually finding myself writing very similar SQL count queries like so to generate probability tables.

My tables are defined such that a value of 0 implies that an event did not take place while a value of 1 implies that the event did take place.

  > sqldf("select count(distinct Date) from joinedData where C_O_Above_prevHigh = 0 and  C_O_Below_prevLow = 0")
  count(distinct Date)
1                 1081

> sqldf("select count(distinct Date) from joinedData where C_O_Above_prevHigh = 0 and C_O_Below_prevLow = 0 and E_halfGap = 1")
  count(distinct Date)
1                  956

> sqldf("select count(distinct Date) from joinedData where C_O_Above_prevHigh = 1 OR C_O_Below_prevLow = 1 and E_halfGap = 1")
  count(distinct Date)
1                  504

In the above example, my predictor variables are C_O_Above_prevHigh and C_O_Below_prevLow my outcome variable is E_halfGap. There are several cases where there might be more predictor variables e.g. Time

Rather than doing the above and manually entering all my queries with different permuations, is there anything available in R or some other application that will:

1) output the potential probability paths based on my predictors? 2) allow me to choose how to split the paths

I appreciate your input.

Vincent Zoonekynd · Accepted Answer

If you want all totals and subtotals, you can use CUBE BY in SQL (but it is not in SQLite) or addmargins in R.

addmargins( Titanic )
# More readable:
ftable( addmargins( Titanic ) )

If you want to build a decision tree, you can use the rpart package or check the machine learning or graphical models task views

Classifcation/Decision Trees and Choosing Splits

Answers (1)

Related Questions