Reputation: 133
I have a data frame with a name column and a date column (that I've converted to number of days since the earliest date). I want a new column that records how many entries the name for that column has in the previous 10 days. For example, here's a current data frame:
> head(starts[,c(5,29)], n=10)
name date_num
1 James 2
2 James 3
3 James 4
4 James 5
5 Julia 1
6 Julia 2
7 Julia 3
8 Julia 4
9 Jess 2
10 Jess 4
and here's what I'm looking for:
> head(starts[,c(5,29)], n=10)
name date_num count
1 James 2 0
2 James 3 1
3 James 4 2
4 James 5 3
5 Julia 1 0
6 Julia 2 1
7 Julia 3 2
8 Julia 4 3
9 Jess 2 0
10 Jess 4 1
That's pretty simplified, but hopefully what I'm going for is clear. Thanks for any help! EDIT: based on the first couple answers, I don't think I was clear enough. I'm not trying to add an index column for each individual name (thought it might look similar); I want a count of the number of rows with the same name AND a date_num value between x-10 and x, where x is the date_num of the row in question. Here's a longer data frame which hopefully is a bit more informative:
> head(starts[,c(5,29)], n=10)
name date_num count
1 James 1 0
2 James 2 1
3 James 3 2
4 James 4 3
5 James 5 4
6 James 7 5
7 James 8 6
8 James 9 7
9 James 10 8
10 James 11 9
11 James 12 9
12 James 13 9
13 James 15 8
14 Julia 1 0
15 Julia 2 1
16 Julia 4 2
17 Julia 19 0
18 Julia 20 1
19 Julia 22 2
20 Julia 24 3
21 Julia 31 2
Hopefully this makes more sense. The count is only of the rows with dates 10 days prior or fewer and the same name, not all preceding rows. One method I can describe but not code is: for a given row, take the date_num, subtract 10, and find the row with the same name and the smallest date_num greater than that figure. Take the difference between those two row's index numbers. That requires some sorting and indexing, but it seems doable. If anyone can describe a way to make R do that, I would be hugely grateful.
Upvotes: 1
Views: 151
Reputation: 13580
Base package:
rle
df$count <- unlist(sapply(rle(df$name)$lengths, seq))-1
df$count <- ave(rep(NA, nrow(df)), df$name, FUN=seq_along)-1
sqldf
library(sqldf)
sqldf("SELECT a.name, a.date_num, COUNT(*)-1 As count
FROM df AS a INNER JOIN df As b
ON (a.name = b.name) AND (a.rowid >= b.rowid)
GROUP BY a.name, a.date_num")
Output:
name date_num count
1 James 2 0
2 James 3 1
3 James 4 2
4 James 5 3
5 Julia 1 0
6 Julia 2 1
7 Julia 3 2
8 Julia 4 3
9 Jess 2 0
10 Jess 4 1
Data:
df <- structure(list(name = c("James", "James", "James", "James", "Julia",
"Julia", "Julia", "Julia", "Jess", "Jess"), date_num = c(2L,
3L, 4L, 5L, 1L, 2L, 3L, 4L, 2L, 4L)), .Names = c("name", "date_num"),
class = "data.frame", row.names = c(NA, -10L))
Upvotes: 0
Reputation: 193687
You can try getanID
from my "splitstackshape" package:
> library(splitstackshape)
> getanID(mydf, "name")[, .id := .id - 1][]
name date_num .id
1: James 2 0
2: James 3 1
3: James 4 2
4: James 5 3
5: Julia 1 0
6: Julia 2 1
7: Julia 3 2
8: Julia 4 3
9: Jess 2 0
10: Jess 4 1
The .id := .id - 1
is to get the counts to start from zero.
That's pretty much like:
library(data.table)
as.data.table(mydf)[, count := 0:(.N-1), by = name][]
Upvotes: 2