How to delete rows in R, which contains the same "set" of data within a column

Question

I have a dataframe of data with two columns lift and skill.set. Within skill.set, contains a string of skills separated by commas. For example, there might be an obversation with the string "Excel, PowerPoint" and another with "PowerPoint, Excel." These two observations are capturing the same data, and I only want to keep one of them (the one with the highest lift). I can't think of an efficient way of doing this without naively turning each string into a vector with elements separated by comma, and then writing a for-loop that compares each skill.set data to each other skill.set data.

Example dataframe:

df = structure(list(lift = c(5.71421247789905, 4.65329289252856, 5.87820023244231, 
21.1815668998877), skill.set = c("JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,Microsoft.PowerPoint", 
"Microsoft.PowerPoint,Microsoft.Word,Python,SQL,Microsoft.Excel..MS.Excel.", 
"Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,SQL,Microsoft.PowerPoint", 
"Analytics...Text.Mining,Natural.Language.Processing,Python")), .Names = c("lift", 
"skill.set"), row.names = 239:242, class = "data.frame")

Desired dataframe:

structure(list(lift = c(5.71421247789905, 5.87820023244231, 21.1815668998877
), skill.set = c("JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,Microsoft.PowerPoint", 
"Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,SQL,Microsoft.PowerPoint", 
"Analytics...Text.Mining,Natural.Language.Processing,Python")), .Names = c("lift", 
"skill.set"), row.names = c(239L, 241L, 242L), class = "data.frame")

Maurits Evers · Accepted Answer

We can do the following:

df[!duplicated(sapply(strsplit(df$skill.set, ","), function(x) 
    paste0(sort(x), collapse = ","))), ]
#         lift
#239  5.714212
#240  4.653293
#242 21.181567
#                                                                           skill.set
#239 JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,Microsoft.PowerPoint
#240        Microsoft.PowerPoint,Microsoft.Word,Python,SQL,Microsoft.Excel..MS.Excel.
#242                       Analytics...Text.Mining,Natural.Language.Processing,Python

Explanation: Split entries in df$skill.set on ",", then sort entries and concatenate; keep only non-duplicated entries.

Update

To only retain the row with the largest lift value we can use aggregate:

setNames(aggregate(
    lift ~ sapply(strsplit(skill.set, ","), function(x) paste0(sort(x), collapse = ",")),
    df,
    max), rev(names(df)))
#1                       Analytics...Text.Mining,Natural.Language.Processing,Python
#2 JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.PowerPoint,Microsoft.Word,Python
#3        Microsoft.Excel..MS.Excel.,Microsoft.PowerPoint,Microsoft.Word,Python,SQL
#       lift
#1 21.181567
#2  5.714212
#3  5.878200

How to delete rows in R, which contains the same "set" of data within a column

Answers (2)

Update

Related Questions

How to delete rows in R, which contains the same &quot;set&quot; of data within a column

Answers (2)

Update

Related Questions

How to delete rows in R, which contains the same "set" of data within a column