scribbles
scribbles

Reputation: 4339

Call data.frame columns inside of R functions?

What is the proper way to do this?

I have a function that works great on its own given a series of inputs and I'd like to use this function on a large dataset rather than singular values by looping through the data by row. I have tried to update the function to call data.frame columns rather than vector values, but have been unsuccessful.

A simple example of this is:

Let's say I have a date.frame with 4 columns, data$id, data$height, data$weight, data$gender. I want to write a function that will loop over each row (using apply) and calculate BMI (kg/m^2). I know that it would be easy to do with dplyr but I would like to learn how to do this without resorting to external packages but can't find a clear answer how to properly reference the columns within the function.

Apologize in advance if this is a duplicate. I've been searching Stackoverflow pretty thoroughly in hopes of finding an exisiting example.

Upvotes: 3

Views: 11428

Answers (3)

deesolie
deesolie

Reputation: 1062

Providing this answer as I was not able to find it on SO and banged my head against the wall trying to figure out why my function within my R package was assuming my new column was an object and not a data.frame column.

If a function takes in a data.frame and within the function you are adding and transforming the additional column(s), the way to do so is as follows:

example_func <- function(df) {
  # To add a new column
  df[["New.Column"]] <- value
  
  # To get the ith value of that column
  df[[i, "New.Column"]]

  # To subset set the df using some conditional logic on that column
  df[df[["New.Column"]]==value]

  # To sort on that column
  setorderv(df, "New.Column", -1)
}

Note this requires library(devtools)

Upvotes: 0

bgoldst
bgoldst

Reputation: 35314

Speaking generally, functions should not know about more than they need to know about. If you write a function that requires a data.frame, when it is not essential that the input data be provided in a data.frame, then you are making your function more restrictive than it needs to be.

The correct way to write this function is as follows:

bmi <- function(height,weight) weight/height^2;

This will allow you compute a vector of BMI values from a vector of height values and a vector of weight values, since both / and ^ are vectorized operations. So, for example, if you had two loose vectors of height and weight, then you could call it as follows:

set.seed(1);
N <- 5;
height <- rnorm(N,1.7,0.2);
weight <- rnorm(N,65,4);
BMI <- bmi(height,weight);
height; weight; BMI;
## [1] 1.574709 1.736729 1.532874 2.019056 1.765902
## [1] 61.71813 66.94972 67.95330 67.30313 63.77845
## [1] 24.88926 22.19652 28.91995 16.50967 20.45224

And if you had your inputs contained in a data.frame, you would be able to do this:

set.seed(2);
N <- 5;
df <- data.frame(id=1:N, height=rnorm(N,1.7,0.2), weight=rnorm(N,65,4), gender=sample(c('M','F'),N,replace=T) );
df$BMI <- bmi(df$height,df$weight);
df;
##   id   height   weight gender      BMI
## 1  1 1.520617 65.52968      F 28.33990
## 2  2 1.736970 67.83182      M 22.48272
## 3  3 2.017569 64.04121      F 15.73268
## 4  4 1.473925 72.93790      M 33.57396
## 5  5 1.683950 64.44485      M 22.72637

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 145755

I think this is what you're looking for. The easiest way to refer to columns of a data frame functionally is to use quoted column names. In principle, what you're doing is this

data[, "weight"] / data[, "height"]^2

but inside a function you might want to let the user specify that the height or weight column is named differently, so you can write your function

add_bmi = function(data, height_col = "height", weight_col = "weight") {
    data$bmi = data[, weight_col] / data[, height_col]
    return(data)
}

This function will assume that the columns to use are named "height" and "weight" by default, but the user can specify other names if necessary. You could do a similar solution using column indices instead, but using names tends to be easier to debug.

Functions this simple are rarely useful. If you're calculating BMI for a lot of datasets maybe it is worth keeping this function around, but since it is a one-liner in base R you probably don't need it.

my_data$BMI = with(my_data, weight / height^2)

One note is that using column names stored in variables means you can't use $. This is the price we pay by making things more programmatic, and it's a good habit to form for such applications. See fortunes::fortune(343):

Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable consequences. It's best to acquire the '[[' and '[' habit early.

-- Peter Ehlers (about the use of $-extraction) R-help (March 2013)

For fancier usage like dplyr does where you don't have to quote column names and such (and can evaluate expressions), the lazyeval package makes things relatively painless and has very nice vignettes.

The base function with can be used to do some lazy evaluating, e.g.,

with(mtcars, plot(disp, mpg))
# sometimes with is nice
plot(mtcars$disp, mtcars$mpg)

but with is best used interactively and in straightforward scripts. If you get into writing programmatic production code (e.g., your own R package), it's safer to avoid non-standard evaluation. See, for example, the warning in ?subset, another base R function that uses non-standard evaluation.

Upvotes: 4

Related Questions