vicky
vicky

Reputation: 15

Using same function on multiple datasets and using specific columns

I have 8 datasets and I want to apply a function to convert any number less than 5 to NA on 3 columns(var1,var2,var3) of each dataset. How can I write a function to do it effectively and faster ? I went through lots of such questions on Stack overflow but I didnt find any answer where specific columns were used. I have written the function to replace but cant figure out how to apply to all the datasets.

Input:
Data1
variable1 variable2 variable3 variable4
10           36        56        99
15           3         2         56
4            24        1         1

Expected output:
variable1 variable2 variable3 variable4
10           36         56        99
15           NA         NA        56
NA           24        NA         1

Perform the same thing for 7 more datasets.

Till now I have stored the needed variables and datasets in two different list.

var1=enquo(variable1)
var2=enquo(variable2)
var3=enquo(variable3)
Total=3


listofdfs=list()
listofdfs_1=list()
for(i in 1:8) {
  df=sym((paste0("Data",i)))
listofdfs[[i]]=df
  }

for(e in 1:Ttoal) {    
listofdfs[[e]]= eval(sym(paste0("var",e)))
}

The selected columns will go through this function:

temp_1=function(x,h) {
  h=enquo(h)
  for(e in 1:Total) {    
  if(substr(eval(sym(paste0("var",e))),1,3)=="var") {
 y= x %>% mutate_at(vars(!!h), ~ replace(., which(.<=5),NA))
 return(y)
  }

}
}

I was expecting something :

lapply(for each dataset's selected columns,temp_1)

Upvotes: 0

Views: 653

Answers (2)

Gregor Thomas
Gregor Thomas

Reputation: 146164

Here's a simple approach that should work:

cols_to_edit = paste0("var", 1:3)
result_list = lapply(list_of_dfs, function(x) {
  x[cols_to_edit][x[cols_to_edit] < 5] = NA
  return(x)
})

I assume your starting data is in a list called list_of_dfs, that the names of columns to edit are the same in all data frames, and that you can construct a character vector cols_to_edit with those names.

Upvotes: 1

Rui Barradas
Rui Barradas

Reputation: 76663

Here is a solution to the problem in the question.
First of all, create a test data set.

createData <- function(Total = 3){
  numcols <- Total + 1
  set.seed(1234)
  for(i in 1:8){
    tmp <- replicate(numcols, sample(10, 20, TRUE))
    tmp <- as.data.frame(tmp)
    names(tmp) <- paste0("var", seq_len(numcols))
    assign(paste0("Data", i), tmp, envir = .GlobalEnv)
  }
}

createData()

Now, the data transformation.
This is much easier if the many dataframes are in a "list".

df_list <- mget(ls(pattern = "^Data"))

I will present solutions, a base R solution and a tidyverse one. Note that both solutions will use function temp_1, written in base R only.

library(tidyverse)

temp_1 <- function(x, h){
  f <- function(v){
    is.na(v) <- v <= 5
    v
  }
  x[h] <- lapply(x[h], f)
  x
}

h <- grep("var[123]", names(df_list[[1]]), value = TRUE)

df_list1 <- lapply(df_list, temp_1, h)
df_list2 <- df_list %>% map(temp_1, h)

identical(df_list1, df_list2)
#[1] TRUE

Upvotes: 0

Related Questions