Hester Lyons
Hester Lyons

Reputation: 823

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):

RecNo   Var1    Var2    Var3
41     800     201.8    Y
43     140        39    N
47      60     20.24    N
49     687        77    Y
54     570       135    Y
58    1250       467    N
61     211        52    N
64      96     117.3    N
68     687        77    Y

Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.

I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:

normalize <- function(x) { 
   x <- sweep(x, 2, apply(x, 2, min)) 
   sweep(x, 2, apply(x, 2, max), "/") 
} 

However, when I apply it to my above data by calling myResult <- normalize(myData)

it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.

So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?

Upvotes: 0

Views: 2616

Answers (2)

Hester Lyons
Hester Lyons

Reputation: 823

I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine: 1. akrun's suggestion above:

myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))

This produces the following:

RecNo       Var1       Var2 Var3
1    41 0.62184874 0.40601834    Y
2    43 0.06722689 0.04195255    N
3    47 0.00000000 0.00000000    N
4    49 0.52689076 0.12693105    Y
5    54 0.42857143 0.25663508    Y
6    58 1.00000000 1.00000000    N
7    61 0.12689076 0.07102414    N
8    64 0.03025210 0.21718329    N
9    68 0.52689076 0.12693105    Y

Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:

> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
  RecNo       Var1       Var2 Var3
1    41 0.62184874 0.40601834    Y
2    43 0.06722689 0.04195255    N
3    47 0.00000000 0.00000000    N
4    49 0.52689076 0.12693105    Y
5    54 0.42857143 0.25663508    Y
6    58 1.00000000 1.00000000    N
7    61 0.12689076 0.07102414    N
8    64 0.03025210 0.21718329    N
9    68 0.52689076 0.12693105    Y

EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:

normalize<-function(x){
    minval=apply(x[,c(2,3)],2,min)
    maxval=apply(x[,c(2,3)],2,max)
    #print(minval)
    #print(maxval)
    y=sweep(x[,c(2,3)],2,minval)
    #print(y)
    sweep(y,2,(maxval-minval),"/")
}

df[,c(2,3)]=normalize(df)

Thank you for your help!

Upvotes: 2

tushaR
tushaR

Reputation: 3116

normalize<-function(x){
    minval=apply(x[,c(2,3)],2,min)
    maxval=apply(x[,c(2,3)],2,max)
    #print(minval)
    #print(maxval)
    y=sweep(x[,c(2,3)],2,minval)
    #print(y)
    sweep(y,2,(maxval-minval),"/")
}

df[,c(2,3)]=normalize(df)

Upvotes: 1

Related Questions