Canovice
Canovice

Reputation: 10441

Vectorizing a for-loop that eliminates duplicate data in dataframe R

I am working with a difficult data manipulation question in R. I am currently using a for-loop to approach the problem, however I would like to vectorize this to have it scale better. I have the following dataframe to work with:

dput(mydf)
structure(list(team_id = c(14L, 14L, 7L, 7L, 21L, 21L, 15L, 15L
), opp_team_id = c(7L, 7L, 14L, 14L, 15L, 15L, 21L, 21L), pg = c(3211L, 
3211L, 786L, 786L, 3914L, 644L, 1524L, 593L), sg = c(653L, 4122L, 
1512L, 1512L, 2593L, 10L, 54L, 54L), sf = c(4122L, 1742L, 2347L, 
2347L, 1352L, 3378L, 2843L, 1062L), pf = c(1742L, 886L, 79L, 
1134L, 687L, 1352L, 1376L, 1376L), c = c(3014L, 2604L, 2960L, 
2960L, 21L, 3216L, 1256L, 3017L), opp_pg = c(3982L, 3982L, 3211L, 
4005L, 1524L, 1524L, 3914L, 644L), opp_sg = c(786L, 2347L, 653L, 
653L, 54L, 802L, 2593L, 10L), opp_sf = c(1134L, 1134L, 4122L, 
1742L, 1062L, 1062L, 3105L, 3105L), opp_pf = c(183L, 183L, 1742L, 
886L, 3017L, 1376L, 3216L, 2135L), opp_c = c(2475L, 2960L, 3138L, 
3138L, 1256L, 3017L, 21L, 1957L)), .Names = c("team_id", "opp_team_id", 
"pg", "sg", "sf", "pf", "c", "opp_pg", "opp_sg", "opp_sf", "opp_pf", 
"opp_c"), row.names = c(NA, -8L), class = "data.frame")

mydf
  team_id opp_team_id   pg   sg   sf   pf    c opp_pg opp_sg opp_sf opp_pf opp_c
1      14           7 3211  653 4122 1742 3014   3982    786   1134    183  2475
2      14           7 3211 4122 1742  886 2604   3982   2347   1134    183  2960
3       7          14  786 1512 2347   79 2960   3211    653   4122   1742  3138
4       7          14  786 1512 2347 1134 2960   4005    653   1742    886  3138
5      21          15 3914 2593 1352  687   21   1524     54   1062   3017  1256
6      21          15  644   10 3378 1352 3216   1524    802   1062   1376  3017
7      15          21 1524   54 2843 1376 1256   3914   2593   3105   3216    21
8      15          21  593   54 1062 1376 3017    644     10   3105   2135  1957

Based on my problem at hand, rows 3-4 and 7-8 are duplicates in this dataframe. Rows 3-4 are duplicates of rows 1-2, and rows 7-8 are duplicates on rows 5-6. This is sports data, and rows 3-4 are essentially rows 1 and 2 except with the team_id and opp_team_id switched, and the same for the other 10 columns (for the most part).

Here is my for-loop for removing duplicates, which I think is quite creative, but is a for-loop nonetheless:

indices = c(1)
TFSwitch = TRUE
for(i in 2:nrow(mydf)) {
  last_row = mydf$team_id[(i-1)]
  this_row = mydf$team_id[i]

  TFSwitch = ifelse(last_row != this_row, !TFSwitch, TFSwitch)  

  if(TFSwitch == TRUE) {
    indices = c(indices, i)
  }
} 

This for-loop goes back and forth checking if the teamID column changes from row to row, and if it does, it toggles TFSwitch from TRUE to FALSE, or vice versa. It then saves the indices I want to keep in a vector.

I would like to vectorize this - any thoughts would be greatly appreciated!

Upvotes: 2

Views: 423

Answers (2)

agstudy
agstudy

Reputation: 121598

Here the same solution using data.table. My understating is that you want to remove duplicated by pairs not just finding unique indices.

library(data.table)
setDT(mydf)
mydf[,c("id1","id2"):=list(pmax(team_id,opp_team_id),pmin(team_id,opp_team_id))]
setkey(mydf,team_id,opp_team_id)[unique(mydf,by=c("id1","id2"))]

Upvotes: 1

thelatemail
thelatemail

Reputation: 93938

This is very similar to previous problems involving pairwise duplicate removal like: (pair-wise duplicate removal from dataframe). So following a similar procedure, and adding a little merge() back to get the indices, you can do:

vars <- c("team_id","opp_team_id")

mx <- do.call(pmax, mydf[vars])
mn <- do.call(pmin, mydf[vars])

merge(
  cbind(mydf[vars], ind=seq_len(nrow(mydf))),
  mydf[!duplicated(data.frame(mx,mn)), vars]
)[,"ind"]

# [1] 1 2 5 6

Upvotes: 4

Related Questions