donkeyshark
donkeyshark

Reputation: 111

Efficiently reshaping a non-standard dummy-coded matrix or table in R

I have a data frame that has several hundred thousand rows and 6 columns. Each column contains IDs (there are about 500 unique IDs in total). I would like to convert this data frame into a large table/matrix with each unique ID having its own column, and having a -1, 0, or 1 for each entry corresponding to the following logic: 0 if the ID is not present, -1 if the ID is in the first 3 columns, and 1 if the ID is in the last 3 columns.

I can use the brute force approach, looping through each row, one-by-one, but I'm looking for a faster and more polished way to do this. My preference would be to use a dplyr solution, assuming one exists. I'm guessing there's also a nifty way to do this with data.table, or even just a simple base R approach. Any help would be appreciated!

Thanks in advance. Here's an example of what my data looks like:

df <- data.frame(matrix(c("XX001","XX002","XX003","XX007","XX008","XX009",
                          "XX001","XX004","XX005","XX006","XX010","XX008",
                          "XX003","XX002","XX005","XX008","XX009","XX010",
                          "XX002","XX005","XX003","XX009","XX007","XX010",
                          "XX001","XX002","XX004","XX007","XX009","XX006"),
                        nrow=5, ncol=6, byrow=1))
names(df) <- c("ID_X1","ID_X2","ID_X3","ID_Y1","ID_Y2","ID_Y3")
df

> df
  ID_X1 ID_X2 ID_X3 ID_Y1 ID_Y2 ID_Y3
1 XX001 XX002 XX003 XX007 XX008 XX009
2 XX001 XX004 XX005 XX006 XX010 XX008
3 XX003 XX002 XX005 XX008 XX009 XX010
4 XX002 XX005 XX003 XX009 XX007 XX010
5 XX001 XX002 XX004 XX007 XX009 XX006

And here's what I would like my output to look like:

> yay
  XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010  ... XX500
1    -1    -1    -1     0     0     0     1     1     1     0  ...     0
2    -1     0     0    -1    -1     1     0     1     0     1  ...     0
3     0    -1    -1     0    -1     0     0     1     1     1  ...     0
4     0    -1    -1     0    -1     0     1     0     1     1  ...     0
5    -1    -1     0    -1     0     1     1     0     1     0  ...     0

Upvotes: 2

Views: 67

Answers (2)

Duck
Duck

Reputation: 39605

Try this:

library(dplyr)
library(tidyr)
#Code
newdf <- df %>% mutate(id=row_number()) %>%
  pivot_longer(-id) %>%
  group_by(id) %>%
  mutate(Val=ifelse(row_number() %in% 1:3,-1,
                    ifelse(row_number() %in% ((n()-3):n()),1,0))) %>%
  select(-name) %>%
  pivot_wider(names_from = value,values_from=Val,names_sort = T,values_fill = 0) %>%
  ungroup() %>%
  select(-id)

Output:

# A tibble: 5 x 10
  XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    -1    -1    -1     0     0     0     1     1     1     0
2    -1     0     0    -1    -1     1     0     1     0     1
3     0    -1    -1     0    -1     0     0     1     1     1
4     0    -1    -1     0    -1     0     1     0     1     1
5    -1    -1     0    -1     0     1     1     0     1     0

Update: As OP is having issues with duplicated values, here a possible sketch to tackle the task. First a dummy data:

df2
  ID_X1 ID_X2 ID_X3 ID_Y1 ID_Y2 ID_Y3
1 XX001 XX001 XX003 XX007 XX008 XX009
2 XX001 XX004 XX005 XX006 XX010 XX008
3 XX003 XX002 XX005 XX008 XX009 XX010
4 XX002 XX005 XX003 XX009 XX007 XX010
5 XX001 XX002 XX004 XX007 XX009 XX006

We can see the first row has a duplicated. So, we can create an index to make difference between duplicated values. Here the code:

#Code 2
newdf <- df2 %>% mutate(id=row_number()) %>%
  pivot_longer(-id) %>%
  group_by(id) %>%
  mutate(Val=ifelse(row_number() %in% 1:3,-1,
                    ifelse(row_number() %in% ((n()-3):n()),1,0))) %>%
  ungroup() %>%
  group_by(id,value) %>%
  mutate(value=paste0(value,'.',row_number())) %>%
  select(-name) %>%
  pivot_wider(names_from = value,values_from=Val,names_sort = T,values_fill = 0) %>%
  ungroup() %>%
  select(-id)

Output:

# A tibble: 5 x 11
  XX001.1 XX001.2 XX002.1 XX003.1 XX004.1 XX005.1 XX006.1 XX007.1 XX008.1 XX009.1 XX010.1
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1      -1      -1       0      -1       0       0       0       1       1       1       0
2      -1       0       0       0      -1      -1       1       0       1       0       1
3       0       0      -1      -1       0      -1       0       0       1       1       1
4       0       0      -1      -1       0      -1       0       1       0       1       1
5      -1       0      -1       0      -1       0       1       1       0       1       0

In this way, the duplicated value has been kept.

Upvotes: 2

Marcos P&#233;rez
Marcos P&#233;rez

Reputation: 1250

This is a vectorized solution:

id <- as.character(as.matrix(df)) %>% unique(.)
id <- id[order(id)]

match_id <- function(x) match(id,x)

yay <- as.data.frame(t(apply(df,1,match_id)))
names(yay) <- id
yay[yay<=3] <- -1
yay[yay>3] <- 1
yay[is.na(yay)] <- 0

Output:

yay
#   XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010
# 1    -1    -1    -1     0     0     0     1     1     1     0
# 2    -1     0     0    -1    -1     1     0     1     0     1
# 3     0    -1    -1     0    -1     0     0     1     1     1
# 4     0    -1    -1     0    -1     0     1     0     1     1
# 5    -1    -1     0    -1     0     1     1     0     1     0

Upvotes: 1

Related Questions