Reputation: 12767
For the second time in two weeks, I'm working with data that includes a ton of empty columns. It's public records data, I'm only interested in one category. I suspect that other categories of the larger data set use these columns, but the subset I care about doesn't. So I filter out the records I don't want, and then I'd like to systematically cull the empty columns.
This question has a great method:
R: Remove multiple empty columns of character variables
empty_columns <- sapply(df, function (k) all(is.na(k) | k == ""))
df <- df[!empty_columns]
But I'd like to make that a function, so I can run it using the name of the data frame exactly once. Something like:
drop_empty_cols <- function(df) {
empty_columns <- sapply(df, function (k) all(is.na(k) | k == ""))
df <- df[!empty_columns]
}
drop_empty_cols(my_frame)
But ... the method above fails, and fails silently. Here's some sample data:
demo <- read.table(text="Real.Val All.NA Nothin.here
1 3.5 NA tmp
2 3.0 NA tmp
3 3.2 NA tmp
4 3.1 NA tmp
5 3.6 NA tmp
6 3.9 NA tmp" , header = TRUE)
demo$Nothin.here <- ""
(I'm sure there's a way to write a reproducible example with an empty column, but mine was choking. So this empties it after you create the frame.)
If I do drop_empty_cols(demo)
I still have 6 obs. of 3 variables
. If I do
empty_columns <- sapply(demo, function (k) all(is.na(k) | k == ""))
demo <- demo[!empty_columns]
I get the desired result: 6 obs. of 1 variable
. But to reuse that I have to replace demo
three times. Is it even possible to use a function to transform a data frame directly?
Upvotes: 2
Views: 47
Reputation: 14370
I think your problem is pretty much boils down to one of scope. In R when you call a function, everything created in that function is local and not accessible outside that function. So when you are passing your demo dataframe to the function is it manipulating it inside that function but it is not accessible outside the function. In order to get the result out of the function people usually return a value and assign the result. Such as:
add<- function(x,y) { return(x+y)}
res <- add(1,2)
> res
[1] 3
While this is the case in your specific example, you can, if you really want to, manipulate your demo object within your function call. You can do this by using the global assignment operator <<-
however this is strongly recommended against.
Anyway for the answer, I think there are 2 ways you go about solving your problem. 1 good and 1 bad. The good way is by returning your manipulated dataframe at the end of your function which you can then store. This is done by:
drop_empty_cols <- function(df) {
empty_columns <- sapply(df, function (k) all(is.na(k) | k == ""))
return(df[!empty_columns])
}
res<-drop_empty_cols(demo)
str(res)
'data.frame': 6 obs. of 1 variable:
$ Real.Val: num 3.5 3 3.2 3.1 3.6 3.9
Here we can see the output is 6 observations and 1 variable as expected.
On the other hand you can use the global assignment operator (which I personally don't like because things can get confusing and you can overwrite results unknowingly). The code for this method is:
drop_empty_cols <- function(df) {
empty_columns <- sapply(df, function (k) all(is.na(k) | k == ""))
demo <<- (df[!empty_columns])
}
drop_empty_cols(demo)
str(demo)
'data.frame': 6 obs. of 1 variable:
$ Real.Val: num 3.5 3 3.2 3.1 3.6 3.9
This gives the same output as the above method. However, note that we don't actually store anything, we can simply call the function to manipulate the demo data. Furthermore, any function call will overwrite your demo data since that is fixed in demo <<- (df[!empty_columns])
Upvotes: 2