fromtheloam
fromtheloam

Reputation: 457

Performing a for loop on a matrix instead of a data frame

I am performing a rather complicated linear regression that involves conditionally creating dummy variables in new columns with a for loop. So far I've been doing this in a couple of data frames, converting them to matrices, then converting those to sparse matrices, and then joining; however, I've reached my computer's limit. Sorry if this gets confusing - I've tried to simplify the process as much as I can.

EDIT - added all numeric examples to original question.

Here is the source data with all numeric values:

df <- data.frame(matrix(nrow = 9, ncol = 4))
df$X1 <- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
df$X2 <- c(10001, 10001, 10001, 10003, 10003, 10003, 10002, 10002, 10002) 
df$X3 <- c(10002, 10002, 10002, 10001, 10001, 10001, 10003, 10003, 10003) 
df$X4 <- c(10001, 10001, 10001, 10003, 10003, 10003, 10002, 10002, 10002)
names(df) <- c("response", "group_1", "group_2", "exclude")

What that looks like:

  response group_1 group_2 exclude
1        5   10001   10002   10001
2        1   10001   10002   10001
3        2   10001   10002   10001
4        0   10003   10001   10003
5        4   10003   10001   10003
6        8   10003   10001   10003
7        7   10002   10003   10002
8        6   10002   10003   10002
9        0   10002   10003   10002

Source data (please see above edit):

df <- data.frame(matrix(nrow = 9, ncol = 4))
df$X1 <- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
df$X2 <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green") 
df$X3 <- c("green", "green", "green", "blue", "blue", "blue", "yellow", "yellow", "yellow") 
df$X4 <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
names(df) <- c("response", "group_1", "group_2", "exclude") 

This is a simplified version of what the data looks like:

  response group_1 group_2 exclude
1        5    blue   green    blue
2        1    blue   green    blue
3        2    blue   green    blue
4        0  yellow    blue  yellow
5        4  yellow    blue  yellow
6        8  yellow    blue  yellow
7        7   green  yellow   green
8        6   green  yellow   green
9        0   green  yellow   green

From the above data, I find the unique variables in "group_1" and "group_2" using the following function:

fun_names <- function(x) {
  row1 <- unique(x$group_1)
  row2 <- unique(x$group_2)
  mat <- data.frame(matrix(nrow = length(row1) + length(row2), ncol = 1))
  mat[1] <- c(row1, row2)
  mat_unique <- data.frame(mat[!duplicated(mat[,1]), ])
  names(mat_unique) <- c("ID")

  return(mat_unique)
}
df_unique <- fun_names(df)

This returns the following data frame:

      ID
1   blue
2 yellow
3  green

Then for each color ("ID") I create a new column with a value of 1 if the color is in each row and the color does not match the "exclude" column value. The loop looks like this:

for(name in df_unique$ID) {
  df[paste(name)] <- 
    ifelse(df$group_1 == name & df$exclude != name | 
           df$group_2 == name & df$exclude != name, 1, 0)
}

Running this loop returns the final data.frame which looks like this:

EDIT Here is the numeric data final df:

  response group_1 group_2 exclude 10001 10003 10002
1        5   10001   10002   10001     0     0     1
2        1   10001   10002   10001     0     0     1
3        2   10001   10002   10001     0     0     1
4        0   10003   10001   10003     1     0     0
5        4   10003   10001   10003     1     0     0
6        8   10003   10001   10003     1     0     0
7        7   10002   10003   10002     0     1     0
8        6   10002   10003   10002     0     1     0
9        0   10002   10003   10002     0     1     0

Here is the original data:

  response group_1 group_2 exclude blue yellow green
1        5    blue   green    blue    0      0     1
2        1    blue   green    blue    0      0     1
3        2    blue   green    blue    0      0     1
4        0  yellow    blue  yellow    1      0     0
5        4  yellow    blue  yellow    1      0     0
6        8  yellow    blue  yellow    1      0     0
7        7   green  yellow   green    0      1     0
8        6   green  yellow   green    0      1     0
9        0   green  yellow   green    0      1     0

So, my question: how do I perform this loop if the original data is a matrix (instead of a data frame)? Since the loop is modifying a data frame, I need to convert that data frame to a matrix in order to convert it to a sparse matrix - this data.frame to data.matrix conversion is too intensive for my machine.

I have converted everything in my code up until the above for loop to matrix notation, but I can't figure out how to print the new columns in this manner while modifying a matrix in R (instead of a data frame). Basically, I'm hoping someone could help me modify the for loop so it will work on a matrix. Does any one have any suggestions?

EDIT I forgot to mention that the source data needs to retain it's grouping - group_by(response, group_1, group_2, exclude). Also, the df object needs to start as a matrix to remove the data.frame to data.matrix conversion.

EDIT2 I did not mention this, but all data is indexed and converted into a numeric value before I run the entire process. So the df object in the example would actually be only numbers.

Upvotes: 0

Views: 680

Answers (3)

Roland
Roland

Reputation: 132864

Use a sparse matrix for the dummy encoding:

m <- as.matrix(df)

groups <- unique(as.vector(m[, grep("group", colnames(m))]))
tmp <- lapply(groups, function(x, m) 
  which((m[, "group_1"] == x | m[, "group_2"] == x) & m[, "exclude"] != x),
       m = m)

j = rep(seq_along(tmp), lengths(tmp))
i = unlist(tmp)

library(Matrix)
dummies <- sparseMatrix(i, j, dims = c(nrow(m), length(groups)))
colnames(dummies) <- groups

M <- Matrix(as.matrix(df))
cbind(M, dummies)
#9 x 7 Matrix of class "dgeMatrix"
#     response group_1 group_2 exclude 10001 10003 10002
#[1,]        5   10001   10002   10001     0     0     1
#[2,]        1   10001   10002   10001     0     0     1
#[3,]        2   10001   10002   10001     0     0     1
#[4,]        0   10003   10001   10003     1     0     0
#[5,]        4   10003   10001   10003     1     0     0
#[6,]        8   10003   10001   10003     1     0     0
#[7,]        7   10002   10003   10002     0     1     0
#[8,]        6   10002   10003   10002     0     1     0
#[9,]        0   10002   10003   10002     0     1     0

Upvotes: 1

tushaR
tushaR

Reputation: 3116

So I am starting with a matrix like this:

m <- matrix(nrow = 9, ncol = 4)
m[,1]<- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
m[,2] <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green") 
m[,3] <- c("green", "green", "green", "blue", "blue", "blue", "yellow", "yellow", "yellow") 
m[,4] <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
colnames(m) <- c("response", "group_1", "group_2", "exclude")

>m
 #    response group_1  group_2  exclude 
 #[1,] "5"      "blue"   "green"  "blue"  
 #[2,] "1"      "blue"   "green"  "blue"  
 #[3,] "2"      "blue"   "green"  "blue"  
 #[4,] "0"      "yellow" "blue"   "yellow"
 #[5,] "4"      "yellow" "blue"   "yellow"
 #[6,] "8"      "yellow" "blue"   "yellow"
 #[7,] "7"      "green"  "yellow" "green" 
 #[8,] "6"      "green"  "yellow" "green" 
 #[9,] "0"      "green"  "yellow" "green"

Using the package dummies' dummy() function:

one_hot_encoded_vars = dummy(x="group_2", data = m))
>one_hot_encoded_vars
 #        group_2blue group_2green group_2yellow
 #[1,]           0            1             0
 #[2,]           0            1             0
 #[3,]           0            1             0
 #[4,]           1            0             0
 #[5,]           1            0             0
 #[6,]           1            0             0
 #[7,]           0            0             1
 #[8,]           0            0             1
 #[9,]           0            0             1

To create a numeric matrix with all variables included:

finalmatrix = cbind(as.numeric(m[,'response']),dummy(x = 'group_1',data = m),
    dummy(x = 'group_2',data = m),dummy(x = 'exclude',data=m))

>finalmatrix
#             group_1blue group_1green group_1yellow group_2blue group_2green group_2yellow excludeblue excludegreen
 #[1,] 5           1            0             0           0            1             0           1            0
 #[2,] 1           1            0             0           0            1             0           1            0
 #[3,] 2           1            0             0           0            1             0           1            0
 #[4,] 0           0            0             1           1            0             0           0            0
 #[5,] 4           0            0             1           1            0             0           0            0
 #[6,] 8           0            0             1           1            0             0           0            0
 #[7,] 7           0            1             0           0            0             1           0            1
 #[8,] 6           0            1             0           0            0             1           0            1
 #[9,] 0           0            1             0           0            0             1           0            1
 #         excludeyellow
 #[1,]             0
 #[2,]             0
 #[3,]             0
 #[4,]             1
 #[5,]             1
 #[6,]             1
 #[7,]             0
 #[8,]             0
 #[9,]             0

If you want to retain the group info you can:

 finalmatrix = cbind(m, finalmatrix)

But then finalmatrix will be character type object.

Upvotes: 1

lebelinoz
lebelinoz

Reputation: 5068

Is this too intense for your matrices? It uses dplyr and tidyr to do away with for-loops altogether:

library(dplyr)
library(tidyr)

m = df %>% 
    mutate(group = ifelse(group_1 == exclude, group_2, group_1), ones = 1) %>%
    select(response, group, ones) %>%
    spread(key = group, value = ones, fill = 0) %>%
    as.matrix

Upvotes: 1

Related Questions