Reputation: 133

Counting matching values in a data frame based on row name

I have a data frame with a name column and a date column (that I've converted to number of days since the earliest date). I want a new column that records how many entries the name for that column has in the previous 10 days. For example, here's a current data frame:

> head(starts[,c(5,29)], n=10)
    name            date_num
1   James           2
2   James           3
3   James           4
4   James           5
5   Julia           1
6   Julia           2
7   Julia           3
8   Julia           4
9   Jess            2
10  Jess            4

and here's what I'm looking for:

> head(starts[,c(5,29)], n=10)
    name            date_num  count
1   James           2         0
2   James           3         1
3   James           4         2
4   James           5         3
5   Julia           1         0
6   Julia           2         1
7   Julia           3         2
8   Julia           4         3
9   Jess            2         0
10  Jess            4         1

That's pretty simplified, but hopefully what I'm going for is clear. Thanks for any help! EDIT: based on the first couple answers, I don't think I was clear enough. I'm not trying to add an index column for each individual name (thought it might look similar); I want a count of the number of rows with the same name AND a date_num value between x-10 and x, where x is the date_num of the row in question. Here's a longer data frame which hopefully is a bit more informative:

> head(starts[,c(5,29)], n=10)
    name            date_num  count
1   James           1         0
2   James           2         1
3   James           3         2
4   James           4         3
5   James           5         4
6   James           7         5
7   James           8         6
8   James           9         7
9   James           10        8
10  James           11        9
11  James           12        9
12  James           13        9
13  James           15        8
14  Julia           1         0
15  Julia           2         1
16  Julia           4         2
17  Julia           19        0
18  Julia           20        1
19  Julia           22        2
20  Julia           24        3
21  Julia           31        2

Hopefully this makes more sense. The count is only of the rows with dates 10 days prior or fewer and the same name, not all preceding rows. One method I can describe but not code is: for a given row, take the date_num, subtract 10, and find the row with the same name and the smallest date_num greater than that figure. Take the difference between those two row's index numbers. That requires some sorting and indexing, but it seems doable. If anyone can describe a way to make R do that, I would be hugely grateful.

Upvotes: 1

Answers (2)

mpalanco

Reputation: 13580

Base package:

Using rle

df$count <- unlist(sapply(rle(df$name)$lengths, seq))-1

Modifying an answer by Marek to start from zero:

df$count <- ave(rep(NA, nrow(df)), df$name, FUN=seq_along)-1

sqldf

library(sqldf)
sqldf("SELECT a.name, a.date_num, COUNT(*)-1 As count
      FROM df AS a INNER JOIN df As b
      ON (a.name = b.name) AND (a.rowid >= b.rowid)
      GROUP BY a.name, a.date_num")

Output:

    name date_num count
1  James        2     0
2  James        3     1
3  James        4     2
4  James        5     3
5  Julia        1     0
6  Julia        2     1
7  Julia        3     2
8  Julia        4     3
9   Jess        2     0
10  Jess        4     1

Data:

df <- structure(list(name = c("James", "James", "James", "James", "Julia", 
     "Julia", "Julia", "Julia", "Jess", "Jess"), date_num = c(2L, 
      3L, 4L, 5L, 1L, 2L, 3L, 4L, 2L, 4L)), .Names = c("name", "date_num"), 
      class = "data.frame", row.names = c(NA, -10L))

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

You can try getanID from my "splitstackshape" package:

> library(splitstackshape)
> getanID(mydf, "name")[, .id := .id - 1][]
     name date_num .id
 1: James        2   0
 2: James        3   1
 3: James        4   2
 4: James        5   3
 5: Julia        1   0
 6: Julia        2   1
 7: Julia        3   2
 8: Julia        4   3
 9:  Jess        2   0
10:  Jess        4   1

The .id := .id - 1 is to get the counts to start from zero.

That's pretty much like:

library(data.table)
as.data.table(mydf)[, count := 0:(.N-1), by = name][]

Upvotes: 2

Counting matching values in a data frame based on row name

Answers (2)

Related Questions