erc
erc

Reputation: 10131

Using dplyr with filter, group_by & tail?

Here's an example df:

df <- structure(list(x = 1:30, y = 101:130, g = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("x", "y", "g"), row.names = c(NA, -30L), class = "data.frame")

I would like to get the 10 lowest values of y for each group within the filtered data.

But

df2 <- df %>% filter(x>3) %>% group_by(g) %>%  tail(y, n=10)

only returns the rows for the last group (C in this case):

Source: local data frame [10 x 3]
Groups: g

    x   y g
18 21 121 C
19 22 122 C
20 23 123 C
21 24 124 C
22 25 125 C
23 26 126 C
24 27 127 C
25 28 128 C
26 29 129 C
27 30 130 C

What am I doing wrong?

Upvotes: 14

Views: 10677

Answers (6)

talat
talat

Reputation: 70316

Here are two other options:

df %>% filter(x>3) %>% group_by(g) %>% top_n(3, desc(y))

Here we make use of top_n but use desc(y) since we want the lowest y values instead of the largest ("top") y values.

df %>% filter(x>3) %>% group_by(g) %>% arrange(y) %>% filter(1:n() <= 10)

which is equal to

df %>% filter(x>3) %>% group_by(g) %>% arrange(y) %>% slice(1:10)

After the grouping, we sort each group by increasing y and then select the first 10 rows per group (or less if there are not 10 rows in a group).

Since there was some confusion about lowest and last values to be selected: this answer selects the lowest values, not the last entries.

Upvotes: 7

eddi
eddi

Reputation: 49448

More compact syntax via data.table:

library(data.table)
dt = as.data.table(df)

# original tail question
dt[x > 3, tail(y, 10), by = g]

# 10 smallest values of y
# many options for this, here's one:
dt[x > 3, head(sort(y), 10), by = g]

# here's another, trying to take advantage of setkey speed
setkey(dt, g, y)
dt[x > 3, head(y, 10), by = g]

Upvotes: 2

aosmith
aosmith

Reputation: 36084

You can use tail inside do.

df2 <- df %>% filter(x>3) %>% group_by(g) %>%  do(tail(., n=10))

The use of . is key for this to work. From the do help page: "You can use . to refer to the current group."

Edit:

As @beginneR pointed out, I was focusing on how to use tail in groups with dplyr and missed the part of the question where the OP asked for the 10 lowest values of y. To do this correctly would take the addition of arrange. With tail, this would mean arranging by descending order of y.

df2 <- df %>% filter(x>3) %>% group_by(g) %>%  arrange(desc(y)) %>% do(tail(., n=10))

Upvotes: 25

AndrewMacDonald
AndrewMacDonald

Reputation: 2950

moar answers! This is a good job for one of the "window functions" that calculates ranks.

df %>%
  filter(x > 3) %>%
  group_by(g) %>%
  filter(y %>% min_rank <= 10)

Upvotes: 2

Spacedman
Spacedman

Reputation: 94237

Why should tail work with a grouped data frame? It doesn't know about groups.

Add an order column, select by that column, drop that column. Here I use 3 instead of 10 for compactness (and %.% because old version of dplyr):

> df %.% filter(x>3) %.% group_by(g) %.% mutate(i=order(y)) %.% filter(i <= 3) %.% select(-matches("i"))
Source: local data frame [9 x 3]
Groups: g

   x   y g
1  4 104 A
2  5 105 A
3  6 106 A
4 11 111 B
5 12 112 B
6 13 113 B
7 21 121 C
8 22 122 C
9 23 123 C

Upvotes: 1

joran
joran

Reputation: 173677

I don't know why tail isn't working, but you might try this:

df %>% 
 filter(x > 3) %>% 
 group_by(g) %>% 
 filter(.,rank(desc(y),ties.method = "min") <= 10)

Upvotes: 2

Related Questions