Henry D
Henry D

Reputation: 133

Counting matching values in a data frame based on row name

I have a data frame with a name column and a date column (that I've converted to number of days since the earliest date). I want a new column that records how many entries the name for that column has in the previous 10 days. For example, here's a current data frame:

> head(starts[,c(5,29)], n=10)
    name            date_num
1   James           2
2   James           3
3   James           4
4   James           5
5   Julia           1
6   Julia           2
7   Julia           3
8   Julia           4
9   Jess            2
10  Jess            4

and here's what I'm looking for:

> head(starts[,c(5,29)], n=10)
    name            date_num  count
1   James           2         0
2   James           3         1
3   James           4         2
4   James           5         3
5   Julia           1         0
6   Julia           2         1
7   Julia           3         2
8   Julia           4         3
9   Jess            2         0
10  Jess            4         1

That's pretty simplified, but hopefully what I'm going for is clear. Thanks for any help! EDIT: based on the first couple answers, I don't think I was clear enough. I'm not trying to add an index column for each individual name (thought it might look similar); I want a count of the number of rows with the same name AND a date_num value between x-10 and x, where x is the date_num of the row in question. Here's a longer data frame which hopefully is a bit more informative:

> head(starts[,c(5,29)], n=10)
    name            date_num  count
1   James           1         0
2   James           2         1
3   James           3         2
4   James           4         3
5   James           5         4
6   James           7         5
7   James           8         6
8   James           9         7
9   James           10        8
10  James           11        9
11  James           12        9
12  James           13        9
13  James           15        8
14  Julia           1         0
15  Julia           2         1
16  Julia           4         2
17  Julia           19        0
18  Julia           20        1
19  Julia           22        2
20  Julia           24        3
21  Julia           31        2

Hopefully this makes more sense. The count is only of the rows with dates 10 days prior or fewer and the same name, not all preceding rows. One method I can describe but not code is: for a given row, take the date_num, subtract 10, and find the row with the same name and the smallest date_num greater than that figure. Take the difference between those two row's index numbers. That requires some sorting and indexing, but it seems doable. If anyone can describe a way to make R do that, I would be hugely grateful.

Upvotes: 1

Views: 151

Answers (2)

mpalanco
mpalanco

Reputation: 13580

Base package:

  • Using rle
  • df$count <- unlist(sapply(rle(df$name)$lengths, seq))-1
    

  • Modifying an answer by Marek to start from zero:
  • df$count <- ave(rep(NA, nrow(df)), df$name, FUN=seq_along)-1
    

    sqldf

    library(sqldf)
    sqldf("SELECT a.name, a.date_num, COUNT(*)-1 As count
          FROM df AS a INNER JOIN df As b
          ON (a.name = b.name) AND (a.rowid >= b.rowid)
          GROUP BY a.name, a.date_num")
    

    Output:

        name date_num count
    1  James        2     0
    2  James        3     1
    3  James        4     2
    4  James        5     3
    5  Julia        1     0
    6  Julia        2     1
    7  Julia        3     2
    8  Julia        4     3
    9   Jess        2     0
    10  Jess        4     1
    

    Data:

    df <- structure(list(name = c("James", "James", "James", "James", "Julia", 
         "Julia", "Julia", "Julia", "Jess", "Jess"), date_num = c(2L, 
          3L, 4L, 5L, 1L, 2L, 3L, 4L, 2L, 4L)), .Names = c("name", "date_num"), 
          class = "data.frame", row.names = c(NA, -10L))
    

    Upvotes: 0

    A5C1D2H2I1M1N2O1R2T1
    A5C1D2H2I1M1N2O1R2T1

    Reputation: 193687

    You can try getanID from my "splitstackshape" package:

    > library(splitstackshape)
    > getanID(mydf, "name")[, .id := .id - 1][]
         name date_num .id
     1: James        2   0
     2: James        3   1
     3: James        4   2
     4: James        5   3
     5: Julia        1   0
     6: Julia        2   1
     7: Julia        3   2
     8: Julia        4   3
     9:  Jess        2   0
    10:  Jess        4   1
    

    The .id := .id - 1 is to get the counts to start from zero.

    That's pretty much like:

    library(data.table)
    as.data.table(mydf)[, count := 0:(.N-1), by = name][]
    

    Upvotes: 2

    Related Questions