Jack
Jack

Reputation: 165

How to recode dataset based on the values?

I have a big dataset with its format being similar to the followings:

names <- c('s1','s2','s3', 's4', 's5','s6', 's7', 's8','s9')
metals <- c(4.2, 5.3, 5.4,6, 7,8.5,0, 10.1,11)
plastics <- c(5.1, 0, 2.4,6.1, 7.7,5.5,1.99, 0 ,2.5)
grade<- c("AA", "AB", "AB", "AB", "AC" , "AB", NA , NA, NA)
my_df <- data.frame(names, metals, plastics, grade )

I need to recode each column For numeric columns I need to assign 1 where the value is greater than 0 and for the "grade" columns lets assume I want AA=1, AB=2, AC=3. what is the most efficient way to do so?

Thanks!

Upvotes: 0

Views: 231

Answers (4)

thelatemail
thelatemail

Reputation: 93908

As always in R, there are a million ways to do even the simplest task. Here's 2 more:

numvars <- sapply(my_df, is.numeric)
my_df[numvars] <- lapply(my_df[numvars], findInterval, 1)


my_df$grade <- c(2,1,3)[match(my_df$grade, c("AB","AA","AC"))]
               #newvals                    #oldvals

#  names metals plastics grade
#1    s1      1        1     1
#2    s2      1        0     2
#3    s3      1        1     2
#4    s4      1        1     2
#5    s5      1        1     3
#6    s6      1        1     2
#7    s7      0        1    NA
#8    s8      1        0    NA
#9    s9      1        1    NA

Upvotes: 2

Silence Dogood
Silence Dogood

Reputation: 3597

Using apply for numeric columns and match for character column

Edited as per@thelatemail's comments to avoid intermediate matrix coercion

my_df[,sapply(my_df,is.numeric)] = lapply(my_df[,sapply(my_df,is.numeric)],function(x) ifelse(x>0,1,0))

my_df$grade = match(my_df$grade,c("AA","AB","AC"))

my_df
#  names metals plastics grade
#1    s1      1        1     1
#2    s2      1        0     2
#3    s3      1        1     2
#4    s4      1        1     2
#5    s5      1        1     3
#6    s6      1        1     2
#7    s7      0        1    NA
#8    s8      1        0    NA
#9    s9      1        1    NA

There will be other solutions using data.table,dplyr soon. You could use microbenchmark to choose the best solution

Upvotes: 1

acylam
acylam

Reputation: 18681

Going off of @MFR's answer, here are two ways to do it:

NumColsToReplace = c("metals", "plastics")
my_df[NumColsToReplace] = ifelse(my_df[NumColsToReplace] > 0, 1, 0)

This allows you to pre-specify the columns you want to replace without copying the second line a bunch of times.

There is also another more efficient way using lapply and replace:

my_df[NumColsToReplace] = lapply(my_df[NumColsToReplace], 
                                 function(x) replace(x, x>0, 1))

This may be more typing, but it is two times as fast (or more) as the first method. Below are some benchmarking:

Unit: microseconds
                                                                   expr      min
 lapply(my_df[NumColsToReplace], function(x) replace(x, x > 0,      1))     23.949
                                  ifelse(my_df[NumColsToReplace] > 0, 1, 0) 59.445
     lq     mean median     uq     max neval
 26.515 29.92362 28.654 30.364  57.306   100
 62.438 68.84436 63.721 73.129 159.515   100

So depending on how large your dataframe is. You way want to consider the second method.

levels(my_df$grade) <- c(1,2,3) to recode grade mentioned by @thelatemail seems to be the most efficient.

Upvotes: 0

MFR
MFR

Reputation: 2077

Not sure if this one is the most efficient one, but we can use recode in car package for the character column.

my_df$metals <- ifelse (my_df$metals > 0, 1 , 0)

my_df$plastics <- ifelse (my_df$plastics > 0, 1 , 0)

library(car)
my_df$grade<-recode(my_df$grade, "'AA'=1; 'AB'='2'; 'AC'='3'")

Output

names metals plastics grade
1    s1      1        1     1
2    s2      1        0     2
3    s3      1        1     2
4    s4      1        1     2
5    s5      1        1     3
6    s6      1        1     2
7    s7      0        1  <NA>
8    s8      1        0  <NA>
9    s9      1        1  <NA>

Upvotes: 1

Related Questions