Reputation: 299
I have written my own function to build a knn model.
It works well with numerical data.
My question is how to prepare categorical and mixed data for KNN in R?
I will provide two type of data I encountered.
1- Mixed data
Some rows and columns of the data
V1 V2 V3 V4 V5 V6
1 39 State-gov 77516 Bachelors 13 Never-married
2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
3 38 Private 215646 HS-grad 9 Divorced
4 53 Private 234721 11th 7 Married-civ-spouse
5 28 Private 338409 Bachelors 13 Married-civ-spouse
6 37 Private 284582 Masters 14 Married-civ-spouse
7 49 Private 160187 9th 5 Married-spouse-absent
8 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse
9 31 Private 45781 Masters 14 Never-married
10 42 Private 159449 Bachelors 13 Married-civ-spouse
11 37 Private 280464 Some-college 10 Married-civ-spouse
12 30 State-gov 141297 Bachelors 13 Married-civ-spouse
Some rows and columns of the data
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1 p x s n t p f c n k e e s s w w p w o p k s u
2 e x s y t a f c b k e c s s w w p w o p n n g
3 e b s w t l f c b n e c s s w w p w o p n n m
4 p x y w t p f c n n e e s s w w p w o p k s u
5 e x s g f n f w b k t e s s w w p w o e n a g
6 e x y y t a f c b n e c s s w w p w o p k n g
7 e b s w t a f c b g e c s s w w p w o p k n m
8 e b y w t l f c b n e c s s w w p w o p n s m
9 p x y w t p f c n p e e s s w w p w o p k v g
10 e b s y t a f c b g e c s s w w p w o p k s m
11 e x y y t l f c b g e c s s w w p w o p n n g
Upvotes: 2
Views: 1139
Reputation: 1441
The example with one column. (df
is your Mixed data)
library(CatEncoders)
test <- df$V4 # select one column
lenc <- LabelEncoder.fit(test)
print(lenc)
# An object of class "LabelEncoder.Factor"
# Slot "classes":
# [1] 11th 9th Bachelors HS-grad Masters
# [6] Some-college
# Levels: 11th 9th Bachelors HS-grad Masters Some-college
#
# Slot "type":
# [1] "factor"
#
# Slot "mapping":
# classes ind
# 1 11th 1
# 2 9th 2
# 3 Bachelors 3
# 4 HS-grad 4
# 5 Masters 5
# 6 Some-college 6
tranformed_test <- transform(lenc, test)
print(tranformed_test)
# [1] 3 3 4 1 3 5 2 4 5 3 6 3
Update
Use sapply
function to transform all columns in dataframe
t <- function(x) {
# check if x is numeric
if(is.numeric(x)) {
return (x)
}
l <- LabelEncoder.fit(x)
y <- transform(l, x)
return (y)
}
new_df <- sapply(df, t)
Upvotes: 1