Gotmadstacks
Gotmadstacks

Reputation: 369

R - Remove combinations of variables that occur more than once in a data.frame

Say I have a dataframe, df, with three vectors:

  colours   individual value
1   white individual 1   0.4
2   white individual 1   0.7
3   black individual 2   1.1
4   black individual 3   0.5

Sometimes the same person shows up multiple times for the same colour but different values. I would like to write some code that would delete all of the instances in which this happens.

***EDIT: There are many more rows than 4 - millions - I don't think the current solutions work.

I would like to count how many times the string I am currently on, in my for loop, comes up and then delete them from the data.frame. So in the example above, I would like to get rid of individual 1. The df would then leave the other two rows.

So far my approach was this:

  1. Get a list of all the colours

  2. Get a list of all the individuals

  3. Write two for loops.

    colours <- unique(df$colours) ind <- unique(df$individual) for (i in ind) { for (c in colour) { #something here. Probably if, asking if the person I'm on in the loop #is found with the colour I am on, more than once, get rid of them } }

My expected output is this:

colours  individual   value

black   individual 2   1.1

black   individual 3   0.5

Source data

df <- data.frame(colours = c("white", "white", "black", "black"),
                 individual = c("individual 1", "individual 1", "individual 2", "individual 3"),
                 value = c(0.4, 0.7, 1.1, 0.5))

Upvotes: 3

Views: 4237

Answers (5)

RHertel
RHertel

Reputation: 23788

You could try with anti_join() from the dplyr library:

library(dplyr)
anti_join(df1, df1[duplicated(df1[1:2]),], by="individual")
#  colours   individual value
#1   black individual 3   0.5
#2   black individual 2   1.1

Upvotes: 5

Gotmadstacks
Gotmadstacks

Reputation: 369

On the basis of some suggestions in the comments, this answer worked best:

df[!(duplicated(df[,1:2]) | duplicated(df[,1:2], fromLast = TRUE)), ]

Slightly different to the comments. This specifies the columns rather than the rows and so achieves the result I wanted from the question (remove those rows where individual and colour are duplicated). More useful generally because the example data in the question is only four rows as opposed to millions.

Upvotes: 1

Sam Firke
Sam Firke

Reputation: 23014

A straightforward dplyr approach would be to group as desired and filter for groups with fewer than 2 observations:

library(dplyr)
df %>%
  group_by(colours, individual) %>%
  filter(n() < 2)

Source: local data frame [2 x 3]
Groups: colours, individual [2]

  colours   individual value
   (fctr)       (fctr) (dbl)
1   black individual 2   1.1
2   black individual 3   0.5

Upvotes: 2

akrun
akrun

Reputation: 887098

Here is another option using data.table

library(data.table)
setDT(df1)[, if(.N==1) .SD , .(colours, individual)]
#   colours   individual value
#1:   black individual 2   1.1
#2:   black individual 3   0.5

Upvotes: 1

Ujjwal Kumar
Ujjwal Kumar

Reputation: 581

This should do. I created a sample dataset, added index vector to show that you save only the first occurence of a colour-user occurence. This works is your rownames are actual row-number.

## Data preparation
colours <- sample(c("red","blue","green","yellow"), size = 50, replace = T)
users <- sample(1:10, size=50, replace=T )
df <- data.frame(colours,users)
df$value <- runif(50)
df$index <- 1:50

## Keep only the first occurence
res <- unique(df[,1:2])
res$values <- df$value[as.integer(rownames(res))]

Upvotes: 0

Related Questions