Reputation: 21430
I have the following data:
> head(bigdata)
type text
1 neutral The week in 32 photos
2 neutral Look at me! 22 selfies of the week
3 neutral Inside rebel tunnels in Homs
4 neutral Voices from Ukraine
5 neutral Water dries up ahead of World Cup
6 positive Who's your hero? Nominate them
My duplicates will look like this (with empty $type
):
7 Who's your hero? Nominate them
8 Water dries up ahead of World Cup
I remove duplicates like this:
bigdata <- bigdata[!duplicated(bigdata$text),]
The problem is, it removes the wrong duplicate. I want to remove the one where $type
is empty, not the one that has a value for $type
.
How can I remove a specific duplicate in R?
Upvotes: 1
Views: 403
Reputation: 59355
So here's a solution that does not use duplicated(...)
.
# creates an example - you have this already...
set.seed(1) # for reproducible example
bigdata <- data.frame(type=rep(c("positive","negative"),5),
text=sample(letters[1:10],10),
stringsAsFactors=F)
# add some duplicates
bigdata <- rbind(bigdata,data.frame(type="",text=bigdata$text[1:5]))
# you start here...
newdf <- with(bigdata,bigdata[order(text,type,decreasing=T),])
result <- aggregate(newdf,by=list(text=newdf$text),head,1)[2:3]
This sorts bigdata
by text and type, in decreasing order, so that for a given text, the empty type
will appear after any non-empty type
. Then we extract only the first occurrence of each type for every text
.
If your data really is "big", then a data.table
solution will probably be faster.
library(data.table)
DT <- as.data.table(bigdata)
setkey(DT, text, type)
DT.result <- DT[, list(type = type[.N]), by = text]
This does basically the same thing, but since setkey
sorts only in increasing order, we use type[.N]
to get the last occurrence of type
for a every text
. .N
is a special variable that holds the number of elements for that group.
Note that the current development version implements a function setorder()
, which orders a data.table
by reference, and can order in both increasing and decreasing order. So, using the devel version, it'd be:
require(data.table) # 1.9.3
setorder(DT, text, -type)
DT[, list(type = type[1L]), by = text]
Upvotes: 2
Reputation: 3188
foo = function(x){
x == ""
}
bigdata <- bigdata[-(!duplicated(bigdata$text)&sapply(bigdata$type, foo)),]
Upvotes: 1
Reputation: 44320
You should keep rows that are either not duplicated or not missing a type value. The duplicated
function only returns the second and later duplicates of each value (check out duplicated(c(1, 1, 2))
), so we need to use both that value and the value of duplicated
called with fromLast=TRUE
.
bigdata <- bigdata[!(duplicated(bigdata$text) |
duplicated(bigdata$text, fromLast=TRUE)) |
!is.na(bigdata$type),]
Upvotes: 1