nik
nik

Reputation: 2584

how can I find percentage of similar string within and across various columns

I have a data like below

df<-structure(list(V1 = structure(c(5L, 1L, 7L, 3L, 2L, 4L, 6L, 6L
), .Label = c("CPSIAAAIAAVNALHGR", "DLNYCFSGMSDHR", "FPEHELIVDPQR", 
"IADPDAVKPDDWDEDAPSK", "LWADHGVQACFGR", "WGEAGAEYVVESTGVFTTMEK", 
"YYVTIIDAPGHR"), class = "factor"), V2 = structure(c(5L, 2L, 
7L, 3L, 4L, 6L, 1L, 1L), .Label = c("", "CPSIAAAIAAVNALHGR", 
"GCITIIGGGDTATCCAK", "HVGPGVLSMANAGPNTNGSQFFICTIK", "LLELGPKPEVAQQTR", 
"MVCCSAWSEDHPICNLFTCGFDR", "YYVTIIDAPGHR"), class = "factor"), 
    V3 = structure(c(4L, 3L, 2L, 4L, 3L, 1L, 1L, 1L), .Label = c("", 
    "AVCMLSNTTAIAEAWAR", "DLNYCFSGMSDHR", "FPEHELIVDPQR"), class = "factor")), .Names = c("V1", 
"V2", "V3"), class = "data.frame", row.names = c(NA, -8L))

I want to know which and how many of the strings are shared in each column

for example

CPSIAAAIAAVNALHGR          1,2       
YYVTIIDAPGHR               1,2
WGEAGAEYVVESTGVFTTMEK      1,1
FPEHELIVDPQR               1,3
DLNYCFSGMSDHR              1,3
DLNYCFSGMSDHR              1,3

It means the first string CPSIAAAIAAVNALHGR is repeated in first column and second column. YYVTIIDAPGHR is repetaed in first column and second column . etc etc

And then give me a percentage first column has 8 rows, among these 8 rows, it shares 2 rows with column 2 so it has 2/8*100 = 25% share first with third share 3/8*100 = 37% column 2 with column 3 shares 0% etc etc

Upvotes: 1

Views: 79

Answers (1)

AntoniosK
AntoniosK

Reputation: 16121

I really feel that number of common stings after removing duplicates for each row (i.e considering unique strings that match) is more useful. So, I'm updating my solution and I'm adding some more code to get the Venn diagram.

library(dplyr)
library(tidyr)
library(gplots)

# reshape dataset
df_reshaped = df %>% 
  gather(column, string) %>% 
  filter(string != '') %>%
  distinct()

# dataset that shows all strings and in which columns they appear
df_result1 = df_reshaped %>% 
  group_by(string) %>% 
  summarise(columns = paste(unique(column), collapse=","))

df_result1


# # A tibble: 12 x 2
#                        string columns
#                         <chr>   <chr>
# 1           AVCMLSNTTAIAEAWAR      V3
# 2           CPSIAAAIAAVNALHGR   V1,V2
# 3               DLNYCFSGMSDHR   V1,V3
# 4                FPEHELIVDPQR   V1,V3
# 5           GCITIIGGGDTATCCAK      V2
# 6 HVGPGVLSMANAGPNTNGSQFFICTIK      V2
# 7         IADPDAVKPDDWDEDAPSK      V1
# 8             LLELGPKPEVAQQTR      V2
# 9               LWADHGVQACFGR      V1
# 10     MVCCSAWSEDHPICNLFTCGFDR     V2
# 11       WGEAGAEYVVESTGVFTTMEK     V1
# 12                YYVTIIDAPGHR  V1,V2


# function to get number of common rows
f1 = function(v1, v2) {
  x1 = (df_reshaped %>% filter(column == v1))$string
  x2 = (df_reshaped %>% filter(column == v2))$string
  length(x2[x2 %in% x1]) }
f1 = Vectorize(f1)


# function to get number of rows of each column
f2 = function(v) {df_reshaped %>% filter(column == v) %>% nrow}
f2 = Vectorize(f2)

# dataset that shows overlap of columns (number of common strings)
expand.grid(unique(df_reshaped$column), unique(df_reshaped$column)) %>%
  filter(Var1 != Var2) %>%
  mutate(NumShared = f1(Var1, Var2),
         NumRows = f2(Var1),
         Prc = NumShared/NumRows) %>%
  arrange(Var1, Var2)

#   Var1 Var2 NumShared NumRows       Prc
# 1   V1   V2         2       7 0.2857143
# 2   V1   V3         2       7 0.2857143
# 3   V2   V1         2       6 0.3333333
# 4   V2   V3         0       6 0.0000000
# 5   V3   V1         2       3 0.6666667
# 6   V3   V2         0       3 0.0000000


# reshape dataset and create a Venn diagram
df_reshaped %>%
  mutate(exist = TRUE) %>%
  spread(column, exist, fill=FALSE) %>%
  select(-string) %>%
  venn()

The Venn diagram looks like: enter image description here

Obviously, the sum of the numbers shown in this diagram should be equal to the number of unique strings you get in the table df_result1. Which is 12 in our case.

Upvotes: 2

Related Questions