how to find similar strings within a data

Question

My data looks like this

df<- structure(list(A = structure(c(7L, 6L, 5L, 4L, 3L, 2L, 1L, 1L, 
1L), .Label = c("", "P42356;Q8N8J0;A4QPH2", "P67809;Q9Y2T7", 
"Q08554", "Q13835", "Q5T749", "Q9NZT1"), class = "factor"), B = structure(c(9L, 
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("P62861", "P62906", 
"P62979;P0CG47;P0CG48", "P63241;Q6IS14", "Q02413", "Q07955", 
"Q08554", "Q5T749", "Q9UQ80"), class = "factor"), C = structure(c(9L, 
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("", "P62807;O60814;P57053;Q99879;Q99877;Q93079;Q5QNW6;P58876", 
"P63241;Q6IS14", "Q02413", "Q16658", "Q5T750", "Q6P1N9", "Q99497", 
"Q9UQ80"), class = "factor")), .Names = c("A", "B", "C"), class = "data.frame", row.names = c(NA, 
-9L))

I want to count how many elements are in each columns including those that are separated with a ; , for example in this case

first column has 9, second column has 12 elements and the third column has 16 elements. then I want to check how many times a element is repeated in other columns . for example

string      number of times       columns
Q5T749           2              1,2

then remove the strings which are seen more than once from the df

Ista · Accepted Answer

One way to approach this is to start by re-organizing the data into a form that is more convenient to work with. The tidyr and dplyr packages are useful for that sort of thing.

library(tidyr)
df$index <- 1:nrow(df)
df <- gather(df, key = 'variable', value = 'value', -index, na.rm = TRUE)
df <- separate(df, "value", into = paste("x", 1:(1 + max(nchar(gsub("[^;]", "", df$value)))), sep = ""), sep = ";", fill = "right")
df <- gather(df, "which", "value", -index, -variable)

Once you do that counting each element is easy:

addmargins(t(table(df[, c("variable", "value")])), margin = 2)

Dropping duplicates is also easy.

df <- df[!duplicated(df$value), ]

If you really want to put the data back into the original for you can (though I don't recommend it).

df <- spread(df, key = "variable", value = "value")
library(dplyr)
summarize(group_by(df, index), 
          A = paste(na.omit(A), collapse = ";"), 
          B = paste(na.omit(B), collapse = ";"), 
          C = paste(na.omit(C), collapse = ";"))

how to find similar strings within a data

Answers (2)

Related Questions