Cristian Lupascu
Cristian Lupascu

Reputation: 40506

Add new column to DataFrame inside a function

I have a dataframe (called train) that contains a YOB (year of birth) column. I'd like to compute the Age in a separate column, like so:

train$Age = 2016 - train$YOB

This works fine.

The problem is that I would also like to do this operation (along with other preprocessing operations) to a number of other dataframes. So, I was thinking to extract the common parts in a function and pass the dataframes to be processed as parameters to the function:

preprocess = function(d) {
  d$Age = 2016 - d$YOB
  # other transformations...
} 

After defining the function above, I expected that calling preprocess(train) would perform the aforementioned transformations on my dataframe. But it doesn't. For example, train$Age is NULL after the call.

Why doesn't the preprocess function transform the dataframe as expected? Is there a way to fix this?

Upvotes: 0

Views: 2642

Answers (2)

Sandeep S. Sandhu
Sandeep S. Sandhu

Reputation: 427

In R (and almost all languages), when control is transferred to a function, the interpreter sets a "scope" of which variables would be available in the function.

Consider the variables a and b and the function "preprocess":

> a <- 2
> b <- 3
> preprocess <- function(a){a <- a + b; cat("value of a=", a, "\n")}
> preprocess(a)
value of a= 5 
> cat("value of a=", a, "\n")
value of a= 2

Here, the variables "a" and "b" were both visible inside the function, and the value of variable "a" did change within the scope of the function. But as soon as the function completed and returned, this environment was discarded and the updated value of the variable was "lost".

The global value of the variable which was 2 earlier, remained as-is.

However, if you return back the value of "a" from the function, the value of "a" is changed, see this example:

> a <- 2
> b <- 3
> preprocess <- function(a){a <- a + b; cat("value of a=", a, "\n"); return(a)}
> a <- preprocess(a)
value of a= 5 
> cat("value of a=", a, "\n")
value of a= 5

See this help reference within your R session ?environment for more information.

Upvotes: 2

Bernhard
Bernhard

Reputation: 4417

You add the new column only inside the function but functions usually do not Change the values outside of that function. There is a quick and dirty way via <<- but should really not use that ever! Because your function would change values outside of the function and functions are not supposed to do that. It is very bad style. Values should enter functions as arguments and should leave them as return values.

So change the dataframe in your function and give it back as return value:

preprocess = function(d) {
  d$Age = 2016 - d$YOB
  return(d)
} 

test <- data.frame(YOB=2017:2020)

test <- preprocess(test)

print(test)

Upvotes: 1

Related Questions