Reputation: 9
I have some data that I've performed cluster analysis on and need to find breakpoints based on population density. The clusters overlap heavily, so I've sorted the data by population density and want to extract the last value before the 'cluster' column switches to another cluster. Basically the data looks like this:
cluster PopDens
1 5
1 7
2 8
2 9
1 10
1 12
3 14
1 16
And I would want it to return the following:
Cluster PopDens
1 7
2 9
1 12
3 14
1 16
How would I go about achieving this in R?
Upvotes: 0
Views: 98
Reputation: 70336
In base R it could be done using:
x[cumsum(rle(x$cluster)$lengths),]
# cluster PopDens
#2 1 7
#4 2 9
#6 1 12
#7 3 14
#8 1 16
This also translates quite directly to data.table
in case you are interested:
library(data.table)
setDT(x)[cumsum(rle(cluster)$lengths)]
And of course we can also do it in dplyr
:
library(dplyr)
slice(x, cumsum(rle(cluster)$len))
Upvotes: 3
Reputation: 83275
Another data.table
solution:
library(data.table)
setDT(df)[df[, tail(.I,1), rleid(cluster)]$V1]
which gives:
cluster PopDens 1: 1 7 2: 2 9 3: 1 12 4: 3 14 5: 1 16
Upvotes: 0
Reputation: 42602
With data.table
the rleid()
function can by used for grouping:
library(data.table)
setDT(DF)[, .(PopDens = last(PopDens)), .(rleid(cluster), cluster)][, rleid := NULL][]
# cluster PopDens
#1: 1 7
#2: 2 9
#3: 1 12
#4: 3 14
#5: 1 16
There are alternative ways to achieve the same result:
DF[, .(PopDens = PopDens[.N]), .(rleid(cluster), cluster)][, rleid := NULL][]
DF[, .(PopDens = tail(PopDens, 1), .(rleid(cluster), cluster)][, rleid := NULL][]
DF[, .SD[.N], .(rleid(cluster), cluster)][, rleid := NULL][]
DF[, tail(.SD, 1), .(rleid(cluster), cluster)][, rleid := NULL][]
Upvotes: 0