Marc Tulla
Marc Tulla

Reputation: 1771

r + keeping first observation of time series group

A follow-up on this question (I want to keep the threads separate): I want to look at each user and the fruits they ate. But I'm only interested in the first time they eat a fruit. From there, I want to rank order the fruits eaten by time.

Some data:

set.seed(1234)
library(dplyr)

data <- data.frame(
    user = sample(c("1234","9876","4567"), 30, replace = TRUE),
    fruit = sample(c("banana","apple","pear","lemon"), 30, replace = TRUE),
    date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),3))

data <- data %>% arrange(user, date)

In this case, you can see that, for example, User 1234 ate a banana on 2010-02-01, then again on 02-03, 02-04, and 02-05.

   user  fruit       date
1  1234 banana 2010-02-01
2  1234  lemon 2010-02-02
3  1234 banana 2010-02-03
4  1234  apple 2010-02-03
5  1234  lemon 2010-02-03
6  1234 banana 2010-02-04
7  1234 banana 2010-02-05

I don't want to change anything about the relative order of fruits by time, but I do want to remove all subsequent instances of "banana" after the first one (and likewise with every other fruit).

for User 1234 in this case, I'm looking for:

   user  fruit       date
1  1234 banana 2010-02-01
2  1234  lemon 2010-02-02
4  1234  apple 2010-02-03

One way I can think of going about this is arranging the dataframe by user > fruit > date, then keeping only the first unique observation of "fruit" by the user grouping. I'm getting hung up on how exactly to do that in dplyr. Any thoughts?

Upvotes: 1

Views: 1528

Answers (2)

aaronwolen
aaronwolen

Reputation: 3753

A dplyr solution would involve grouping by the user and fruit variables and filtering for rows with the lowest ranked date:

data %>%
  group_by(user, fruit) %>%
  filter(row_number(date) == 1)

Upvotes: 1

Pierre L
Pierre L

Reputation: 28441

Here is a an approach using the duplicated function.

data %>%
group_by(user) %>%
filter(!duplicated(fruit))
#    user  fruit       date
# 1  1234  apple 2010-02-01
# 2  1234 banana 2010-02-01
# 3  1234   pear 2010-02-03
# 4  1234  lemon 2010-02-10
# 5  4567   pear 2010-02-01
# 6  4567 banana 2010-02-05
# 7  4567  lemon 2010-02-08
# 8  9876  apple 2010-02-02
# 9  9876   pear 2010-02-02
# 10 9876  lemon 2010-02-06

Upvotes: 4

Related Questions