Reputation: 870
For a homework assignment, I wrote a function that performs forward step-wise regression. It takes 3 arguments: dependent variable, list of potential independent variables, and the data frame in which these variables are found. Currently all of my inputs except data frame, including the list of independent variables, are strings.
Many built-in functions, as well as functions from high-profile packages, allow for variable inputs that are not strings. Which way is best-practice and why? If non-string is best practice, how can I implement this considering that one of the arguments is a list of variables in the data frame, not a single variable?
Upvotes: 3
Views: 312
Reputation: 15163
Personally I don't see any problem with using strings if it accomplishes what you need it to. If you want, you could rewrite your function to take a formula as input rather than strings to designate independent and dependent variables. In this case your function calls would look like this:
fitmodel(x ~ y + z,data)
rather than this:
fitmodel("x",list("y","z"),data)
Using formulas would allow you to specify simple algebraic combinations of variables to use in your regression, like x ~ y + log(z)
. If you go this route, then you can build the data frame specified by the formula by calling model.frame
and then use this new data frame to run your algorithm. For example:
> df<-data.frame(x=1:10,y=10:1,z=sqrt(1:10))
> model.frame(x ~ y + z,df)
x y z
1 1 10 1.000000
2 2 9 1.414214
3 3 8 1.732051
4 4 7 2.000000
5 5 6 2.236068
6 6 5 2.449490
7 7 4 2.645751
8 8 3 2.828427
9 9 2 3.000000
10 10 1 3.162278
> model.frame(x ~ y + z + I(x^2) + log(z) + I(x*y),df)
x y z I(x^2) log(z) I(x * y)
1 1 10 1.000000 1 0.0000000 10
2 2 9 1.414214 4 0.3465736 18
3 3 8 1.732051 9 0.5493061 24
4 4 7 2.000000 16 0.6931472 28
5 5 6 2.236068 25 0.8047190 30
6 6 5 2.449490 36 0.8958797 30
7 7 4 2.645751 49 0.9729551 28
8 8 3 2.828427 64 1.0397208 24
9 9 2 3.000000 81 1.0986123 18
10 10 1 3.162278 100 1.1512925 10
>
Upvotes: 4