Kevin Sun
Kevin Sun

Reputation: 1211

How to delete rows in R, which contains the same "set" of data within a column

I have a dataframe of data with two columns lift and skill.set. Within skill.set, contains a string of skills separated by commas. For example, there might be an obversation with the string "Excel, PowerPoint" and another with "PowerPoint, Excel." These two observations are capturing the same data, and I only want to keep one of them (the one with the highest lift). I can't think of an efficient way of doing this without naively turning each string into a vector with elements separated by comma, and then writing a for-loop that compares each skill.set data to each other skill.set data.

Example dataframe:

df = structure(list(lift = c(5.71421247789905, 4.65329289252856, 5.87820023244231, 
21.1815668998877), skill.set = c("JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,Microsoft.PowerPoint", 
"Microsoft.PowerPoint,Microsoft.Word,Python,SQL,Microsoft.Excel..MS.Excel.", 
"Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,SQL,Microsoft.PowerPoint", 
"Analytics...Text.Mining,Natural.Language.Processing,Python")), .Names = c("lift", 
"skill.set"), row.names = 239:242, class = "data.frame")

Desired dataframe:

structure(list(lift = c(5.71421247789905, 5.87820023244231, 21.1815668998877
), skill.set = c("JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,Microsoft.PowerPoint", 
"Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,SQL,Microsoft.PowerPoint", 
"Analytics...Text.Mining,Natural.Language.Processing,Python")), .Names = c("lift", 
"skill.set"), row.names = c(239L, 241L, 242L), class = "data.frame")

Upvotes: 0

Views: 44

Answers (2)

Maurits Evers
Maurits Evers

Reputation: 50718

We can do the following:

df[!duplicated(sapply(strsplit(df$skill.set, ","), function(x) 
    paste0(sort(x), collapse = ","))), ]
#         lift
#239  5.714212
#240  4.653293
#242 21.181567
#                                                                           skill.set
#239 JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,Microsoft.PowerPoint
#240        Microsoft.PowerPoint,Microsoft.Word,Python,SQL,Microsoft.Excel..MS.Excel.
#242                       Analytics...Text.Mining,Natural.Language.Processing,Python

Explanation: Split entries in df$skill.set on ",", then sort entries and concatenate; keep only non-duplicated entries.


Update

To only retain the row with the largest lift value we can use aggregate:

setNames(aggregate(
    lift ~ sapply(strsplit(skill.set, ","), function(x) paste0(sort(x), collapse = ",")),
    df,
    max), rev(names(df)))
#1                       Analytics...Text.Mining,Natural.Language.Processing,Python
#2 JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.PowerPoint,Microsoft.Word,Python
#3        Microsoft.Excel..MS.Excel.,Microsoft.PowerPoint,Microsoft.Word,Python,SQL
#       lift
#1 21.181567
#2  5.714212
#3  5.878200

Upvotes: 2

hpesoj626
hpesoj626

Reputation: 3629

Here is another solution using dplyr and tidyr.

df %>%
  separate_rows(skill.set, sep = ",") %>%
  group_by(lift) %>%
  arrange(skill.set) %>%
  mutate(id = row_number()) %>%
  spread(id, skill.set) %>%
  unite(skill.set, 2:6, sep = ",") %>%
  group_by(skill.set) %>%
  summarise_at(vars(lift), max)

You decide which lift you want to keep by changing max to whatever you prefer. Also, change 2:6 based on the column numbers that were produced by the preceding spread.

Upvotes: 2

Related Questions