Reputation: 2584
I have a data like below
df<-structure(list(V1 = structure(c(5L, 1L, 7L, 3L, 2L, 4L, 6L, 6L
), .Label = c("CPSIAAAIAAVNALHGR", "DLNYCFSGMSDHR", "FPEHELIVDPQR",
"IADPDAVKPDDWDEDAPSK", "LWADHGVQACFGR", "WGEAGAEYVVESTGVFTTMEK",
"YYVTIIDAPGHR"), class = "factor"), V2 = structure(c(5L, 2L,
7L, 3L, 4L, 6L, 1L, 1L), .Label = c("", "CPSIAAAIAAVNALHGR",
"GCITIIGGGDTATCCAK", "HVGPGVLSMANAGPNTNGSQFFICTIK", "LLELGPKPEVAQQTR",
"MVCCSAWSEDHPICNLFTCGFDR", "YYVTIIDAPGHR"), class = "factor"),
V3 = structure(c(4L, 3L, 2L, 4L, 3L, 1L, 1L, 1L), .Label = c("",
"AVCMLSNTTAIAEAWAR", "DLNYCFSGMSDHR", "FPEHELIVDPQR"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -8L))
I want to know which and how many of the strings are shared in each column
for example
CPSIAAAIAAVNALHGR 1,2
YYVTIIDAPGHR 1,2
WGEAGAEYVVESTGVFTTMEK 1,1
FPEHELIVDPQR 1,3
DLNYCFSGMSDHR 1,3
DLNYCFSGMSDHR 1,3
It means the first string CPSIAAAIAAVNALHGR
is repeated in first column and second column. YYVTIIDAPGHR
is repetaed in first column and second column . etc etc
And then give me a percentage first column has 8 rows, among these 8 rows, it shares 2 rows with column 2 so it has 2/8*100 = 25% share first with third share 3/8*100 = 37% column 2 with column 3 shares 0% etc etc
Upvotes: 1
Views: 79
Reputation: 16121
I really feel that number of common stings after removing duplicates for each row (i.e considering unique strings that match) is more useful. So, I'm updating my solution and I'm adding some more code to get the Venn diagram.
library(dplyr)
library(tidyr)
library(gplots)
# reshape dataset
df_reshaped = df %>%
gather(column, string) %>%
filter(string != '') %>%
distinct()
# dataset that shows all strings and in which columns they appear
df_result1 = df_reshaped %>%
group_by(string) %>%
summarise(columns = paste(unique(column), collapse=","))
df_result1
# # A tibble: 12 x 2
# string columns
# <chr> <chr>
# 1 AVCMLSNTTAIAEAWAR V3
# 2 CPSIAAAIAAVNALHGR V1,V2
# 3 DLNYCFSGMSDHR V1,V3
# 4 FPEHELIVDPQR V1,V3
# 5 GCITIIGGGDTATCCAK V2
# 6 HVGPGVLSMANAGPNTNGSQFFICTIK V2
# 7 IADPDAVKPDDWDEDAPSK V1
# 8 LLELGPKPEVAQQTR V2
# 9 LWADHGVQACFGR V1
# 10 MVCCSAWSEDHPICNLFTCGFDR V2
# 11 WGEAGAEYVVESTGVFTTMEK V1
# 12 YYVTIIDAPGHR V1,V2
# function to get number of common rows
f1 = function(v1, v2) {
x1 = (df_reshaped %>% filter(column == v1))$string
x2 = (df_reshaped %>% filter(column == v2))$string
length(x2[x2 %in% x1]) }
f1 = Vectorize(f1)
# function to get number of rows of each column
f2 = function(v) {df_reshaped %>% filter(column == v) %>% nrow}
f2 = Vectorize(f2)
# dataset that shows overlap of columns (number of common strings)
expand.grid(unique(df_reshaped$column), unique(df_reshaped$column)) %>%
filter(Var1 != Var2) %>%
mutate(NumShared = f1(Var1, Var2),
NumRows = f2(Var1),
Prc = NumShared/NumRows) %>%
arrange(Var1, Var2)
# Var1 Var2 NumShared NumRows Prc
# 1 V1 V2 2 7 0.2857143
# 2 V1 V3 2 7 0.2857143
# 3 V2 V1 2 6 0.3333333
# 4 V2 V3 0 6 0.0000000
# 5 V3 V1 2 3 0.6666667
# 6 V3 V2 0 3 0.0000000
# reshape dataset and create a Venn diagram
df_reshaped %>%
mutate(exist = TRUE) %>%
spread(column, exist, fill=FALSE) %>%
select(-string) %>%
venn()
Obviously, the sum of the numbers shown in this diagram should be equal to the number of unique strings you get in the table df_result1
. Which is 12 in our case.
Upvotes: 2