how to count and remove similar strings across columns

Question

I have a data with many columns . for example this is with three columns

df<-structure(list(V1 = structure(c(5L, 1L, 7L, 3L, 2L, 4L, 6L, 6L
), .Label = c("CPSIAAAIAAVNALHGR", "DLNYCFSGMSDHR", "FPEHELIVDPQR", 
"IADPDAVKPDDWDEDAPSK", "LWADHGVQACFGR", "WGEAGAEYVVESTGVFTTMEK", 
"YYVTIIDAPGHR"), class = "factor"), V2 = structure(c(5L, 2L, 
7L, 3L, 4L, 6L, 1L, 1L), .Label = c("", "CPSIAAAIAAVNALHGR", 
"GCITIIGGGDTATCCAK", "HVGPGVLSMANAGPNTNGSQFFICTIK", "LLELGPKPEVAQQTR", 
"MVCCSAWSEDHPICNLFTCGFDR", "YYVTIIDAPGHR"), class = "factor"), 
    V3 = structure(c(4L, 3L, 2L, 4L, 3L, 1L, 1L, 1L), .Label = c("", 
    "AVCMLSNTTAIAEAWAR", "DLNYCFSGMSDHR", "FPEHELIVDPQR"), class = "factor")), .Names = c("V1", 
"V2", "V3"), class = "data.frame", row.names = c(NA, -8L))

-The first column, we don't look at any other column, we just count how many strings there are and keep the unique one

The second column, we keep the unique and also we remove those that were already in the first column
The third column, we keep the unique and we remove the strings that were in the first and second column

This continues for as many columns as we have

for example for this data, we will have the following

 Column 1              Column 2                    Column 3
LWADHGVQACFGR
CPSIAAAIAAVNALHGR     LLELGPKPEVAQQTR              AVCMLSNTTAIAEAWAR
YYVTIIDAPGHR          GCITIIGGGDTATCCAK 
FPEHELIVDPQR          HVGPGVLSMANAGPNTNGSQFFICTIK   
DLNYCFSGMSDHR         MVCCSAWSEDHPICNLFTCGFDR   
IADPDAVKPDDWDEDAPSK     
WGEAGAEYVVESTGVFTTMEK

Sotos · Accepted Answer

Here is a solution via tidyverse,

library(tidyverse)

df1 <- df %>% 
 gather(var, string) %>% 
 filter(string != '' & !duplicated(string)) %>% 
 group_by(var) %>% 
 mutate(cnt = seq(n())) %>% 
 spread(var, string) %>%
 select(-cnt)

Which gives

# A tibble: 7 x 4
    cnt                    V1                          V2                V3
*                                                      
1     1         LWADHGVQACFGR             LLELGPKPEVAQQTR AVCMLSNTTAIAEAWAR
2     2     CPSIAAAIAAVNALHGR           GCITIIGGGDTATCCAK              
3     3          YYVTIIDAPGHR HVGPGVLSMANAGPNTNGSQFFICTIK              
4     4          FPEHELIVDPQR     MVCCSAWSEDHPICNLFTCGFDR              
5     5         DLNYCFSGMSDHR                                      
6     6   IADPDAVKPDDWDEDAPSK                                      
7     7 WGEAGAEYVVESTGVFTTMEK

You can use colSums to get the number of strings,

colSums(!is.na(df1))
#V1 V2 V3 
# 7  4  1

A similar approach via base R, that would save the strings in a list would be,

df[] <- lapply(df, as.character)
d1 <- stack(df)
d1 <- d1[d1$values != '' & !duplicated(d1$values),]
l1 <- unstack(d1, values ~ ind)

lengths(l1)
#V1 V2 V3 
# 7  4  1

how to count and remove similar strings across columns

Answers (2)

Related Questions