Reputation: 93
I have a dataframe containing allele data for many individuals. There are several (>20) unique alleles for a gene listed in 2 columns. I would like to count the presence of 1 copy of an unique alleles as 1, 2 copies as 2 and the absence as 0. Also each unique allele count for each individual should be in separate columns.
My current df is
ID allele_1 allele_2
01 A01 A02
02 A01 A07
03 A08 A03
04 A103 A01
05 A09 A08
06 A02 A02
07 A08 A08
I want my df to look like this;
ID A01 A02 A07 A08 A03 A103 A09
01 1 1 0 0 0 0 0
02 1 0 1 0 0 0 0
03 0 0 0 1 1 0 0
04 1 0 0 0 0 1 0
05 0 0 0 1 0 0 1
06 0 2 0 0 0 0 0
07 0 0 0 2 0 0 0
Upvotes: 1
Views: 49
Reputation: 887118
We can convert to 'long' format, get the 'count' and then reshape into 'wide'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -ID) %>%
count(ID, value) %>%
pivot_wider(names_from = value, values_from = n, values_fill = list(n = 0))
# A tibble: 7 x 8
# ID A01 A02 A07 A03 A08 A103 A09
# <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 1 1 0 0 0 0 0
#2 2 1 0 1 0 0 0 0
#3 3 0 0 0 1 1 0 0
#4 4 1 0 0 0 0 1 0
#5 5 0 0 0 0 1 0 1
#6 6 0 2 0 0 0 0 0
#7 7 0 0 0 0 2 0 0
or with table
from base R
table(rep(df1$ID, 2), unlist(df1[-1]))
Or with recast
from reshape2
library(reshape2)
recast(df1, id.var = 'ID', ID ~ value, length)
df1 <- structure(list(ID = 1:7, allele_1 = c("A01", "A01", "A08", "A103",
"A09", "A02", "A08"), allele_2 = c("A02", "A07", "A03", "A01",
"A08", "A02", "A08")), class = "data.frame", row.names = c(NA,
-7L))
Upvotes: 3