Reputation: 369
Say I have a dataframe, df
, with three vectors:
colours individual value
1 white individual 1 0.4
2 white individual 1 0.7
3 black individual 2 1.1
4 black individual 3 0.5
Sometimes the same person shows up multiple times for the same colour but different values. I would like to write some code that would delete all of the instances in which this happens.
***EDIT: There are many more rows than 4 - millions - I don't think the current solutions work.
I would like to count how many times the string I am currently on, in my for loop, comes up and then delete them from the data.frame. So in the example above, I would like to get rid of individual 1. The df would then leave the other two rows.
So far my approach was this:
Get a list of all the colours
Get a list of all the individuals
Write two for loops.
colours <- unique(df$colours)
ind <- unique(df$individual)
for (i in ind)
{
for (c in colour)
{
#something here. Probably if, asking if the person I'm on in the loop
#is found with the colour I am on, more than once, get rid of them
}
}
My expected output is this:
colours individual value
black individual 2 1.1
black individual 3 0.5
Source data
df <- data.frame(colours = c("white", "white", "black", "black"),
individual = c("individual 1", "individual 1", "individual 2", "individual 3"),
value = c(0.4, 0.7, 1.1, 0.5))
Upvotes: 3
Views: 4237
Reputation: 23788
You could try with anti_join()
from the dplyr
library:
library(dplyr)
anti_join(df1, df1[duplicated(df1[1:2]),], by="individual")
# colours individual value
#1 black individual 3 0.5
#2 black individual 2 1.1
Upvotes: 5
Reputation: 369
On the basis of some suggestions in the comments, this answer worked best:
df[!(duplicated(df[,1:2]) | duplicated(df[,1:2], fromLast = TRUE)), ]
Slightly different to the comments. This specifies the columns rather than the rows and so achieves the result I wanted from the question (remove those rows where individual and colour are duplicated). More useful generally because the example data in the question is only four rows as opposed to millions.
Upvotes: 1
Reputation: 23014
A straightforward dplyr approach would be to group as desired and filter for groups with fewer than 2 observations:
library(dplyr)
df %>%
group_by(colours, individual) %>%
filter(n() < 2)
Source: local data frame [2 x 3]
Groups: colours, individual [2]
colours individual value
(fctr) (fctr) (dbl)
1 black individual 2 1.1
2 black individual 3 0.5
Upvotes: 2
Reputation: 887098
Here is another option using data.table
library(data.table)
setDT(df1)[, if(.N==1) .SD , .(colours, individual)]
# colours individual value
#1: black individual 2 1.1
#2: black individual 3 0.5
Upvotes: 1
Reputation: 581
This should do. I created a sample dataset, added index vector to show that you save only the first occurence of a colour-user occurence. This works is your rownames are actual row-number.
## Data preparation
colours <- sample(c("red","blue","green","yellow"), size = 50, replace = T)
users <- sample(1:10, size=50, replace=T )
df <- data.frame(colours,users)
df$value <- runif(50)
df$index <- 1:50
## Keep only the first occurence
res <- unique(df[,1:2])
res$values <- df$value[as.integer(rownames(res))]
Upvotes: 0