Reputation: 1211
I have a dataframe of data with two columns lift
and skill.set
. Within skill.set
, contains a string of skills separated by commas. For example, there might be an obversation with the string "Excel, PowerPoint" and another with "PowerPoint, Excel." These two observations are capturing the same data, and I only want to keep one of them (the one with the highest lift). I can't think of an efficient way of doing this without naively turning each string into a vector with elements separated by comma, and then writing a for-loop that compares each skill.set
data to each other skill.set
data.
Example dataframe:
df = structure(list(lift = c(5.71421247789905, 4.65329289252856, 5.87820023244231,
21.1815668998877), skill.set = c("JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,Microsoft.PowerPoint",
"Microsoft.PowerPoint,Microsoft.Word,Python,SQL,Microsoft.Excel..MS.Excel.",
"Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,SQL,Microsoft.PowerPoint",
"Analytics...Text.Mining,Natural.Language.Processing,Python")), .Names = c("lift",
"skill.set"), row.names = 239:242, class = "data.frame")
Desired dataframe:
structure(list(lift = c(5.71421247789905, 5.87820023244231, 21.1815668998877
), skill.set = c("JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,Microsoft.PowerPoint",
"Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,SQL,Microsoft.PowerPoint",
"Analytics...Text.Mining,Natural.Language.Processing,Python")), .Names = c("lift",
"skill.set"), row.names = c(239L, 241L, 242L), class = "data.frame")
Upvotes: 0
Views: 44
Reputation: 50718
We can do the following:
df[!duplicated(sapply(strsplit(df$skill.set, ","), function(x)
paste0(sort(x), collapse = ","))), ]
# lift
#239 5.714212
#240 4.653293
#242 21.181567
# skill.set
#239 JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.Word,Python,Microsoft.PowerPoint
#240 Microsoft.PowerPoint,Microsoft.Word,Python,SQL,Microsoft.Excel..MS.Excel.
#242 Analytics...Text.Mining,Natural.Language.Processing,Python
Explanation: Split entries in df$skill.set
on ","
, then sort entries and concatenate; keep only non-duplicated entries.
To only retain the row with the largest lift
value we can use aggregate
:
setNames(aggregate(
lift ~ sapply(strsplit(skill.set, ","), function(x) paste0(sort(x), collapse = ",")),
df,
max), rev(names(df)))
#1 Analytics...Text.Mining,Natural.Language.Processing,Python
#2 JavaScript,Microsoft.Excel..MS.Excel.,Microsoft.PowerPoint,Microsoft.Word,Python
#3 Microsoft.Excel..MS.Excel.,Microsoft.PowerPoint,Microsoft.Word,Python,SQL
# lift
#1 21.181567
#2 5.714212
#3 5.878200
Upvotes: 2
Reputation: 3629
Here is another solution using dplyr
and tidyr
.
df %>%
separate_rows(skill.set, sep = ",") %>%
group_by(lift) %>%
arrange(skill.set) %>%
mutate(id = row_number()) %>%
spread(id, skill.set) %>%
unite(skill.set, 2:6, sep = ",") %>%
group_by(skill.set) %>%
summarise_at(vars(lift), max)
You decide which lift
you want to keep by changing max
to whatever you prefer. Also, change 2:6
based on the column numbers that were produced by the preceding spread
.
Upvotes: 2