jeza
jeza

Reputation: 299

Preparing categorical and mixed data for KNN in R

I have written my own function to build a knn model.

It works well with numerical data.

My question is how to prepare categorical and mixed data for KNN in R?

I will provide two type of data I encountered.

1- Mixed data

Some rows and columns of the data

      V1                V2      V3            V4 V5                     V6
1     39         State-gov   77516     Bachelors 13          Never-married
2     50  Self-emp-not-inc   83311     Bachelors 13     Married-civ-spouse
3     38           Private  215646       HS-grad  9               Divorced
4     53           Private  234721          11th  7     Married-civ-spouse
5     28           Private  338409     Bachelors 13     Married-civ-spouse
6     37           Private  284582       Masters 14     Married-civ-spouse
7     49           Private  160187           9th  5  Married-spouse-absent
8     52  Self-emp-not-inc  209642       HS-grad  9     Married-civ-spouse
9     31           Private   45781       Masters 14          Never-married
10    42           Private  159449     Bachelors 13     Married-civ-spouse
11    37           Private  280464  Some-college 10     Married-civ-spouse
12    30         State-gov  141297     Bachelors 13     Married-civ-spouse

2- Categorical data

Some rows and columns of the data

     V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1     p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p   k   s   u
2     e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p   n   n   g
3     e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p   n   n   m
4     p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p   k   s   u
5     e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e   n   a   g
6     e  x  y  y  t  a  f  c  b   n   e   c   s   s   w   w   p   w   o   p   k   n   g
7     e  b  s  w  t  a  f  c  b   g   e   c   s   s   w   w   p   w   o   p   k   n   m
8     e  b  y  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p   n   s   m
9     p  x  y  w  t  p  f  c  n   p   e   e   s   s   w   w   p   w   o   p   k   v   g
10    e  b  s  y  t  a  f  c  b   g   e   c   s   s   w   w   p   w   o   p   k   s   m
11    e  x  y  y  t  l  f  c  b   g   e   c   s   s   w   w   p   w   o   p   n   n   g

Upvotes: 2

Views: 1139

Answers (1)

RobJan
RobJan

Reputation: 1441

The example with one column. (df is your Mixed data)

library(CatEncoders)

test <- df$V4 # select one column

lenc <- LabelEncoder.fit(test)

print(lenc)
# An object of class "LabelEncoder.Factor"
# Slot "classes":
# [1] 11th         9th          Bachelors    HS-grad      Masters
# [6] Some-college
# Levels: 11th 9th Bachelors HS-grad Masters Some-college
#
# Slot "type":
# [1] "factor"
#
# Slot "mapping":
#        classes ind
# 1         11th   1
# 2          9th   2
# 3    Bachelors   3
# 4      HS-grad   4
# 5      Masters   5
# 6 Some-college   6

tranformed_test <- transform(lenc, test)
print(tranformed_test)
# [1] 3 3 4 1 3 5 2 4 5 3 6 3

Update

Use sapply function to transform all columns in dataframe

t <- function(x) {
    # check if x is numeric
    if(is.numeric(x)) {
        return (x)
    }
    l <- LabelEncoder.fit(x)
    y <- transform(l, x)
    return (y)
}

new_df <- sapply(df, t)

Upvotes: 1

Related Questions