Reputation: 10441
I am working with a difficult data manipulation question in R. I am currently using a for-loop to approach the problem, however I would like to vectorize this to have it scale better. I have the following dataframe to work with:
dput(mydf)
structure(list(team_id = c(14L, 14L, 7L, 7L, 21L, 21L, 15L, 15L
), opp_team_id = c(7L, 7L, 14L, 14L, 15L, 15L, 21L, 21L), pg = c(3211L,
3211L, 786L, 786L, 3914L, 644L, 1524L, 593L), sg = c(653L, 4122L,
1512L, 1512L, 2593L, 10L, 54L, 54L), sf = c(4122L, 1742L, 2347L,
2347L, 1352L, 3378L, 2843L, 1062L), pf = c(1742L, 886L, 79L,
1134L, 687L, 1352L, 1376L, 1376L), c = c(3014L, 2604L, 2960L,
2960L, 21L, 3216L, 1256L, 3017L), opp_pg = c(3982L, 3982L, 3211L,
4005L, 1524L, 1524L, 3914L, 644L), opp_sg = c(786L, 2347L, 653L,
653L, 54L, 802L, 2593L, 10L), opp_sf = c(1134L, 1134L, 4122L,
1742L, 1062L, 1062L, 3105L, 3105L), opp_pf = c(183L, 183L, 1742L,
886L, 3017L, 1376L, 3216L, 2135L), opp_c = c(2475L, 2960L, 3138L,
3138L, 1256L, 3017L, 21L, 1957L)), .Names = c("team_id", "opp_team_id",
"pg", "sg", "sf", "pf", "c", "opp_pg", "opp_sg", "opp_sf", "opp_pf",
"opp_c"), row.names = c(NA, -8L), class = "data.frame")
mydf
team_id opp_team_id pg sg sf pf c opp_pg opp_sg opp_sf opp_pf opp_c
1 14 7 3211 653 4122 1742 3014 3982 786 1134 183 2475
2 14 7 3211 4122 1742 886 2604 3982 2347 1134 183 2960
3 7 14 786 1512 2347 79 2960 3211 653 4122 1742 3138
4 7 14 786 1512 2347 1134 2960 4005 653 1742 886 3138
5 21 15 3914 2593 1352 687 21 1524 54 1062 3017 1256
6 21 15 644 10 3378 1352 3216 1524 802 1062 1376 3017
7 15 21 1524 54 2843 1376 1256 3914 2593 3105 3216 21
8 15 21 593 54 1062 1376 3017 644 10 3105 2135 1957
Based on my problem at hand, rows 3-4 and 7-8 are duplicates in this dataframe. Rows 3-4 are duplicates of rows 1-2, and rows 7-8 are duplicates on rows 5-6. This is sports data, and rows 3-4 are essentially rows 1 and 2 except with the team_id and opp_team_id switched, and the same for the other 10 columns (for the most part).
Here is my for-loop for removing duplicates, which I think is quite creative, but is a for-loop nonetheless:
indices = c(1)
TFSwitch = TRUE
for(i in 2:nrow(mydf)) {
last_row = mydf$team_id[(i-1)]
this_row = mydf$team_id[i]
TFSwitch = ifelse(last_row != this_row, !TFSwitch, TFSwitch)
if(TFSwitch == TRUE) {
indices = c(indices, i)
}
}
This for-loop goes back and forth checking if the teamID column changes from row to row, and if it does, it toggles TFSwitch from TRUE to FALSE, or vice versa. It then saves the indices I want to keep in a vector.
I would like to vectorize this - any thoughts would be greatly appreciated!
Upvotes: 2
Views: 423
Reputation: 121598
Here the same solution using data.table
. My understating is that you want to remove duplicated by pairs not just finding unique indices.
library(data.table)
setDT(mydf)
mydf[,c("id1","id2"):=list(pmax(team_id,opp_team_id),pmin(team_id,opp_team_id))]
setkey(mydf,team_id,opp_team_id)[unique(mydf,by=c("id1","id2"))]
Upvotes: 1
Reputation: 93938
This is very similar to previous problems involving pairwise duplicate removal like: (pair-wise duplicate removal from dataframe). So following a similar procedure, and adding a little merge()
back to get the indices, you can do:
vars <- c("team_id","opp_team_id")
mx <- do.call(pmax, mydf[vars])
mn <- do.call(pmin, mydf[vars])
merge(
cbind(mydf[vars], ind=seq_len(nrow(mydf))),
mydf[!duplicated(data.frame(mx,mn)), vars]
)[,"ind"]
# [1] 1 2 5 6
Upvotes: 4