Reputation: 111
I have a data frame that has several hundred thousand rows and 6 columns. Each column contains IDs (there are about 500 unique IDs in total). I would like to convert this data frame into a large table/matrix with each unique ID having its own column, and having a -1, 0, or 1 for each entry corresponding to the following logic: 0 if the ID is not present, -1 if the ID is in the first 3 columns, and 1 if the ID is in the last 3 columns.
I can use the brute force approach, looping through each row, one-by-one, but I'm looking for a faster and more polished way to do this. My preference would be to use a dplyr solution, assuming one exists. I'm guessing there's also a nifty way to do this with data.table, or even just a simple base R approach. Any help would be appreciated!
Thanks in advance. Here's an example of what my data looks like:
df <- data.frame(matrix(c("XX001","XX002","XX003","XX007","XX008","XX009",
"XX001","XX004","XX005","XX006","XX010","XX008",
"XX003","XX002","XX005","XX008","XX009","XX010",
"XX002","XX005","XX003","XX009","XX007","XX010",
"XX001","XX002","XX004","XX007","XX009","XX006"),
nrow=5, ncol=6, byrow=1))
names(df) <- c("ID_X1","ID_X2","ID_X3","ID_Y1","ID_Y2","ID_Y3")
df
> df
ID_X1 ID_X2 ID_X3 ID_Y1 ID_Y2 ID_Y3
1 XX001 XX002 XX003 XX007 XX008 XX009
2 XX001 XX004 XX005 XX006 XX010 XX008
3 XX003 XX002 XX005 XX008 XX009 XX010
4 XX002 XX005 XX003 XX009 XX007 XX010
5 XX001 XX002 XX004 XX007 XX009 XX006
And here's what I would like my output to look like:
> yay
XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010 ... XX500
1 -1 -1 -1 0 0 0 1 1 1 0 ... 0
2 -1 0 0 -1 -1 1 0 1 0 1 ... 0
3 0 -1 -1 0 -1 0 0 1 1 1 ... 0
4 0 -1 -1 0 -1 0 1 0 1 1 ... 0
5 -1 -1 0 -1 0 1 1 0 1 0 ... 0
Upvotes: 2
Views: 67
Reputation: 39605
Try this:
library(dplyr)
library(tidyr)
#Code
newdf <- df %>% mutate(id=row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(Val=ifelse(row_number() %in% 1:3,-1,
ifelse(row_number() %in% ((n()-3):n()),1,0))) %>%
select(-name) %>%
pivot_wider(names_from = value,values_from=Val,names_sort = T,values_fill = 0) %>%
ungroup() %>%
select(-id)
Output:
# A tibble: 5 x 10
XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -1 -1 -1 0 0 0 1 1 1 0
2 -1 0 0 -1 -1 1 0 1 0 1
3 0 -1 -1 0 -1 0 0 1 1 1
4 0 -1 -1 0 -1 0 1 0 1 1
5 -1 -1 0 -1 0 1 1 0 1 0
Update: As OP is having issues with duplicated values, here a possible sketch to tackle the task. First a dummy data:
df2
ID_X1 ID_X2 ID_X3 ID_Y1 ID_Y2 ID_Y3
1 XX001 XX001 XX003 XX007 XX008 XX009
2 XX001 XX004 XX005 XX006 XX010 XX008
3 XX003 XX002 XX005 XX008 XX009 XX010
4 XX002 XX005 XX003 XX009 XX007 XX010
5 XX001 XX002 XX004 XX007 XX009 XX006
We can see the first row has a duplicated. So, we can create an index to make difference between duplicated values. Here the code:
#Code 2
newdf <- df2 %>% mutate(id=row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(Val=ifelse(row_number() %in% 1:3,-1,
ifelse(row_number() %in% ((n()-3):n()),1,0))) %>%
ungroup() %>%
group_by(id,value) %>%
mutate(value=paste0(value,'.',row_number())) %>%
select(-name) %>%
pivot_wider(names_from = value,values_from=Val,names_sort = T,values_fill = 0) %>%
ungroup() %>%
select(-id)
Output:
# A tibble: 5 x 11
XX001.1 XX001.2 XX002.1 XX003.1 XX004.1 XX005.1 XX006.1 XX007.1 XX008.1 XX009.1 XX010.1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -1 -1 0 -1 0 0 0 1 1 1 0
2 -1 0 0 0 -1 -1 1 0 1 0 1
3 0 0 -1 -1 0 -1 0 0 1 1 1
4 0 0 -1 -1 0 -1 0 1 0 1 1
5 -1 0 -1 0 -1 0 1 1 0 1 0
In this way, the duplicated value has been kept.
Upvotes: 2
Reputation: 1250
This is a vectorized solution:
id <- as.character(as.matrix(df)) %>% unique(.)
id <- id[order(id)]
match_id <- function(x) match(id,x)
yay <- as.data.frame(t(apply(df,1,match_id)))
names(yay) <- id
yay[yay<=3] <- -1
yay[yay>3] <- 1
yay[is.na(yay)] <- 0
Output:
yay
# XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010
# 1 -1 -1 -1 0 0 0 1 1 1 0
# 2 -1 0 0 -1 -1 1 0 1 0 1
# 3 0 -1 -1 0 -1 0 0 1 1 1
# 4 0 -1 -1 0 -1 0 1 0 1 1
# 5 -1 -1 0 -1 0 1 1 0 1 0
Upvotes: 1