Reputation: 49
I have a dataset with microRNAs in 8 different groups. I need to transform this data frame into a binary matrix using R. The number of microRNAs is different in the groups and I would like to make the groups in the row and have the microRNAs on the columns. Here is part of the data:
Group1 Group2 Group3 Group4
miR-133a miR-133b miR-456 miR777
miR-777 miR138 miR-564 miR-878
miR-878 miR-777 miR978
miR-878
miR-978
Output expected:
Groups miR-133a miR-133b miR-456 miR-777.....
Group1 1 0 0 1
Group2 0 1 0 0
.
.
.
I tried to use this code:
im <- which(arr.ind=T,Dat!='');
u <- unique(Dat[im[order(im[,'row'],im[,'col']),]]);
res <- matrix(0L,nrow(Dat),length(u),dimnames=list(NULL,u));
res[cbind(im[,'row'],match(Dat[im],u))] <- 1L;
res
But it is giving me a lot of rows. Can anyone help me with that?
Upvotes: 2
Views: 74
Reputation: 46888
Assuming the blanks in your data frame is "" :
df = structure(list(Group1 = c("miR-133a", "miR-777", "miR-878", "",
""), Group2 = c("miR-133b", "miR138", "", "", ""), Group3 = c("miR-456",
"miR-564", "miR-777", "miR-878", "miR-978"), Group4 = c("miR777",
"miR-878", "miR978", "", "")), row.names = c(NA, -5L), class = "data.frame")
Then, make a master set of all items:
alla = setdiff(sort(unique(unlist(df))),"")
res = t(sapply(colnames(df),function(i)as.numeric(alla %in% df[,i])))
colnames(res) = alla
miR-133a miR-133b miR-456 miR-564 miR-777 miR-878 miR-978 miR138 miR777
Group1 1 0 0 0 1 1 0 0 0
Group2 0 1 0 0 0 0 0 1 0
Group3 0 0 1 1 1 1 1 0 0
Group4 0 0 0 0 0 1 0 0 1
miR978
Group1 0
Group2 0
Group3 0
Group4 1
Upvotes: 1
Reputation: 886938
Here is one option with tidyverse
. Reshape to 'long' format, then convert it back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything(), names_to = 'Groups',
values_drop_na = TRUE) %>%
distinct %>%
mutate(new =1) %>%
pivot_wider(names_from =value, values_from = new,
values_fill = list(new = 0))
#Groups `miR-133a` `miR-133b` `miR-456` miR777 `miR-777` miR138 `miR-564` `miR-878` miR978 `miR-978`
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Group1 1 0 0 0 1 0 0 1 0 0
#2 Group2 0 1 0 0 0 1 0 0 0 0
#3 Group3 0 0 1 0 1 0 1 1 0 1
#4 Group4 0 0 0 1 0 0 0 1 1 0
Or in base R
with table
table(names(df1)[col(df1)], unlist(df1))
# miR-133a miR-133b miR-456 miR-564 miR-777 miR-878 miR-978 miR138 miR777 miR978
# Group1 1 0 0 0 1 1 0 0 0 0
# Group2 0 1 0 0 0 0 0 1 0 0
# Group3 0 0 1 1 1 1 1 0 0 0
# Group4 0 0 0 0 0 1 0 0 1 1
NOTE: Here, we assume the blanks as NA
. If it is ""
, first change it to NA
and then use the same code
df1[df1 == ""] <- NA
df1 <- structure(list(Group1 = c("miR-133a", "miR-777", "miR-878", NA,
NA), Group2 = c("miR-133b", "miR138", NA, NA, NA), Group3 = c("miR-456",
"miR-564", "miR-777", "miR-878", "miR-978"), Group4 = c("miR777",
"miR-878", "miR978", NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
Upvotes: 3