Reputation: 457
I am performing a rather complicated linear regression that involves conditionally creating dummy variables in new columns with a for loop. So far I've been doing this in a couple of data frames, converting them to matrices, then converting those to sparse matrices, and then joining; however, I've reached my computer's limit. Sorry if this gets confusing - I've tried to simplify the process as much as I can.
EDIT - added all numeric examples to original question.
Here is the source data with all numeric values:
df <- data.frame(matrix(nrow = 9, ncol = 4))
df$X1 <- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
df$X2 <- c(10001, 10001, 10001, 10003, 10003, 10003, 10002, 10002, 10002)
df$X3 <- c(10002, 10002, 10002, 10001, 10001, 10001, 10003, 10003, 10003)
df$X4 <- c(10001, 10001, 10001, 10003, 10003, 10003, 10002, 10002, 10002)
names(df) <- c("response", "group_1", "group_2", "exclude")
What that looks like:
response group_1 group_2 exclude
1 5 10001 10002 10001
2 1 10001 10002 10001
3 2 10001 10002 10001
4 0 10003 10001 10003
5 4 10003 10001 10003
6 8 10003 10001 10003
7 7 10002 10003 10002
8 6 10002 10003 10002
9 0 10002 10003 10002
Source data (please see above edit):
df <- data.frame(matrix(nrow = 9, ncol = 4))
df$X1 <- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
df$X2 <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
df$X3 <- c("green", "green", "green", "blue", "blue", "blue", "yellow", "yellow", "yellow")
df$X4 <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
names(df) <- c("response", "group_1", "group_2", "exclude")
This is a simplified version of what the data looks like:
response group_1 group_2 exclude
1 5 blue green blue
2 1 blue green blue
3 2 blue green blue
4 0 yellow blue yellow
5 4 yellow blue yellow
6 8 yellow blue yellow
7 7 green yellow green
8 6 green yellow green
9 0 green yellow green
From the above data, I find the unique variables in "group_1" and "group_2" using the following function:
fun_names <- function(x) {
row1 <- unique(x$group_1)
row2 <- unique(x$group_2)
mat <- data.frame(matrix(nrow = length(row1) + length(row2), ncol = 1))
mat[1] <- c(row1, row2)
mat_unique <- data.frame(mat[!duplicated(mat[,1]), ])
names(mat_unique) <- c("ID")
return(mat_unique)
}
df_unique <- fun_names(df)
This returns the following data frame:
ID
1 blue
2 yellow
3 green
Then for each color ("ID") I create a new column with a value of 1 if the color is in each row and the color does not match the "exclude" column value. The loop looks like this:
for(name in df_unique$ID) {
df[paste(name)] <-
ifelse(df$group_1 == name & df$exclude != name |
df$group_2 == name & df$exclude != name, 1, 0)
}
Running this loop returns the final data.frame
which looks like this:
EDIT Here is the numeric data final df:
response group_1 group_2 exclude 10001 10003 10002
1 5 10001 10002 10001 0 0 1
2 1 10001 10002 10001 0 0 1
3 2 10001 10002 10001 0 0 1
4 0 10003 10001 10003 1 0 0
5 4 10003 10001 10003 1 0 0
6 8 10003 10001 10003 1 0 0
7 7 10002 10003 10002 0 1 0
8 6 10002 10003 10002 0 1 0
9 0 10002 10003 10002 0 1 0
Here is the original data:
response group_1 group_2 exclude blue yellow green
1 5 blue green blue 0 0 1
2 1 blue green blue 0 0 1
3 2 blue green blue 0 0 1
4 0 yellow blue yellow 1 0 0
5 4 yellow blue yellow 1 0 0
6 8 yellow blue yellow 1 0 0
7 7 green yellow green 0 1 0
8 6 green yellow green 0 1 0
9 0 green yellow green 0 1 0
So, my question: how do I perform this loop if the original data is a matrix (instead of a data frame)? Since the loop is modifying a data frame, I need to convert that data frame to a matrix in order to convert it to a sparse matrix - this data.frame
to data.matrix
conversion is too intensive for my machine.
I have converted everything in my code up until the above for
loop to matrix notation, but I can't figure out how to print the new columns in this manner while modifying a matrix in R (instead of a data frame). Basically, I'm hoping someone could help me modify the for
loop so it will work on a matrix. Does any one have any suggestions?
EDIT
I forgot to mention that the source data needs to retain it's grouping -
group_by(response, group_1, group_2, exclude)
. Also, the df
object needs to start as a matrix to remove the data.frame
to data.matrix
conversion.
EDIT2
I did not mention this, but all data is indexed and converted into a numeric value before I run the entire process. So the df
object in the example would actually be only numbers.
Upvotes: 0
Views: 680
Reputation: 132864
Use a sparse matrix for the dummy encoding:
m <- as.matrix(df)
groups <- unique(as.vector(m[, grep("group", colnames(m))]))
tmp <- lapply(groups, function(x, m)
which((m[, "group_1"] == x | m[, "group_2"] == x) & m[, "exclude"] != x),
m = m)
j = rep(seq_along(tmp), lengths(tmp))
i = unlist(tmp)
library(Matrix)
dummies <- sparseMatrix(i, j, dims = c(nrow(m), length(groups)))
colnames(dummies) <- groups
M <- Matrix(as.matrix(df))
cbind(M, dummies)
#9 x 7 Matrix of class "dgeMatrix"
# response group_1 group_2 exclude 10001 10003 10002
#[1,] 5 10001 10002 10001 0 0 1
#[2,] 1 10001 10002 10001 0 0 1
#[3,] 2 10001 10002 10001 0 0 1
#[4,] 0 10003 10001 10003 1 0 0
#[5,] 4 10003 10001 10003 1 0 0
#[6,] 8 10003 10001 10003 1 0 0
#[7,] 7 10002 10003 10002 0 1 0
#[8,] 6 10002 10003 10002 0 1 0
#[9,] 0 10002 10003 10002 0 1 0
Upvotes: 1
Reputation: 3116
So I am starting with a matrix like this:
m <- matrix(nrow = 9, ncol = 4)
m[,1]<- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
m[,2] <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
m[,3] <- c("green", "green", "green", "blue", "blue", "blue", "yellow", "yellow", "yellow")
m[,4] <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
colnames(m) <- c("response", "group_1", "group_2", "exclude")
>m
# response group_1 group_2 exclude
#[1,] "5" "blue" "green" "blue"
#[2,] "1" "blue" "green" "blue"
#[3,] "2" "blue" "green" "blue"
#[4,] "0" "yellow" "blue" "yellow"
#[5,] "4" "yellow" "blue" "yellow"
#[6,] "8" "yellow" "blue" "yellow"
#[7,] "7" "green" "yellow" "green"
#[8,] "6" "green" "yellow" "green"
#[9,] "0" "green" "yellow" "green"
Using the package dummies' dummy()
function:
one_hot_encoded_vars = dummy(x="group_2", data = m))
>one_hot_encoded_vars
# group_2blue group_2green group_2yellow
#[1,] 0 1 0
#[2,] 0 1 0
#[3,] 0 1 0
#[4,] 1 0 0
#[5,] 1 0 0
#[6,] 1 0 0
#[7,] 0 0 1
#[8,] 0 0 1
#[9,] 0 0 1
To create a numeric matrix with all variables included:
finalmatrix = cbind(as.numeric(m[,'response']),dummy(x = 'group_1',data = m),
dummy(x = 'group_2',data = m),dummy(x = 'exclude',data=m))
>finalmatrix
# group_1blue group_1green group_1yellow group_2blue group_2green group_2yellow excludeblue excludegreen
#[1,] 5 1 0 0 0 1 0 1 0
#[2,] 1 1 0 0 0 1 0 1 0
#[3,] 2 1 0 0 0 1 0 1 0
#[4,] 0 0 0 1 1 0 0 0 0
#[5,] 4 0 0 1 1 0 0 0 0
#[6,] 8 0 0 1 1 0 0 0 0
#[7,] 7 0 1 0 0 0 1 0 1
#[8,] 6 0 1 0 0 0 1 0 1
#[9,] 0 0 1 0 0 0 1 0 1
# excludeyellow
#[1,] 0
#[2,] 0
#[3,] 0
#[4,] 1
#[5,] 1
#[6,] 1
#[7,] 0
#[8,] 0
#[9,] 0
If you want to retain the group info you can:
finalmatrix = cbind(m, finalmatrix)
But then finalmatrix
will be character type object.
Upvotes: 1
Reputation: 5068
Is this too intense for your matrices? It uses dplyr
and tidyr
to do away with for-loops altogether:
library(dplyr)
library(tidyr)
m = df %>%
mutate(group = ifelse(group_1 == exclude, group_2, group_1), ones = 1) %>%
select(response, group, ones) %>%
spread(key = group, value = ones, fill = 0) %>%
as.matrix
Upvotes: 1