George
George

Reputation: 684

Identifying Patterns by Group in a Data Frame

I currently have a dataset which shows the number of people sick by location per day. For each group (Location) there is an output of 1 (sick) or 0 (not sick) in the 'Sick' column to state whether there is sickness in the area.

My aim is to create a new column that identifies periods of long term sickness and labels any row that contributes to this.

The following data frame shows the sample data:

Current <- structure(list(Location = c("Madrid", "Madrid", "Madrid", "Madrid", 
"Madrid", "Madrid", "Madrid", "Madrid", "Madrid", "Madrid", "Madrid", 
"Madrid", "Madrid", "Madrid", "Madrid"), Date = structure(c(16122, 
16123, 16124, 16125, 16126, 16127, 16128, 16129, 16130, 16131, 
16132, 16133, 16134, 16135, 16136), class = "Date"), Sick = c(1, 
1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0)), .Names = c("Location", 
"Date", "Sick"), row.names = c(NA, -15L), class = c("tbl_df", 
"tbl", "data.frame"))

The rule I would like to apply to the new column is the following:

*# If in any 6 day period there are 4 days of sickness, all days with a sickness recorded should have “Long Term” entered in the new computed variable ('Type'). If this condition is not met then “Short Term” should be entered in the new computed variable which is called 'Type'.

For this to work I need it perform the calculation by group. Location is the group in this analysis so I aim to be able to class days based on location. I have not been able to work this out.

The desired output would look like this:

Desired <- structure(list(Location = c("Madrid", "Madrid", "Madrid", "Madrid", 
    "Madrid", "Madrid", "Madrid", "Madrid", "Madrid", "Madrid", "Madrid", 
    "Madrid", "Madrid", "Madrid", "Madrid"), Date = structure(c(16122, 
    16123, 16124, 16125, 16126, 16127, 16128, 16129, 16130, 16131, 
    16132, 16133, 16134, 16135, 16136), class = "Date"), Sick = c(1, 
    1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0), Type = c(NA, NA, NA, 
    NA, NA, "Long Term", "Long Term", "Long Term", "Long Term", NA, 
    "Long Term", "Long Term", "Long Term", "Long Term", NA)), .Names = c("Location", 
    "Date", "Sick", "Type"), row.names = c(NA, -15L), class = c("tbl_df", 
    "tbl", "data.frame"))

Any help on this is much appreciated and also on any feedback regarding the question style and layout is helpful.

Currently, I have the problem of only the Window term that meets the ifelse criteria being labelled 'Long Term'. What I need is for all observations that form the window testing and have sickness as 1 to be labelled 'Long Term':

Upvotes: 0

Views: 77

Answers (2)

Mark Peterson
Mark Peterson

Reputation: 9570

One approach would be to follow @Noobie's suggestions to find the start of the "Long Term" periods, then extend the entries from there. Here, I am using rollsum from zoo instead, just because that is what I am familiar with working with. Importantly, this assumes that there is an entry for every day (if not, consider using complete with full_seq from tidyr) and that they are in order (if not, use arrange).

First, find spots that start long term periods:

temp <-
  Current %>%
  mutate(rollCount = zoo::rollsum(Sick, 6
                                  , fill = NA
                                  , align = "left")
         , startLong = rollCount >= 4)

Then, add 5 days to each of those starting points (using unique to avoid re-setting multiple times):

toFill <-
  sapply(which(temp$startLong), function(x){
    x + (0:5)
  }) %>%
  as.numeric() %>%
  unique()

Then, add a column for the Type, and set it to Long Term just for those that we pulled in toFill:

temp$Type <- NA    
temp$Type[toFill] <- "Long Term"

Returns:

   Location       Date  Sick rollCount startLong      Type
      <chr>     <date> <dbl>     <dbl>     <lgl>     <chr>
1    Madrid 2014-02-21     1         3     FALSE      <NA>
2    Madrid 2014-02-22     1         3     FALSE      <NA>
3    Madrid 2014-02-23     0         3     FALSE      <NA>
4    Madrid 2014-02-24     0         4      TRUE Long Term
5    Madrid 2014-02-25     0         4      TRUE Long Term
6    Madrid 2014-02-26     1         5      TRUE Long Term
7    Madrid 2014-02-27     1         5      TRUE Long Term
8    Madrid 2014-02-28     1         5      TRUE Long Term
9    Madrid 2014-03-01     1         5      TRUE Long Term
10   Madrid 2014-03-02     0         4      TRUE Long Term
11   Madrid 2014-03-03     1        NA        NA Long Term
12   Madrid 2014-03-04     1        NA        NA Long Term
13   Madrid 2014-03-05     1        NA        NA Long Term
14   Madrid 2014-03-06     1        NA        NA Long Term
15   Madrid 2014-03-07     0        NA        NA Long Term

If you want to use this for multiple different locations, it is probably best to wrap it in a function, then use group_by and mutate to add the column.

myFunction <- function(x){
  temp <-
    data.frame(Sick = x) %>%
    mutate(rollCount = zoo::rollsum(Sick, 6
                                    , fill = NA
                                    , align = "left")
           , startLong = rollCount >= 4)

  toFill <-
    sapply(which(temp$startLong), function(x){
      x + (0:5)
    }) %>%
    as.numeric() %>%
    unique()

  temp$Type <- NA

  temp$Type[toFill] <- "Long Term"

  return(temp$Type)
}

Then, apply it to the data. Note that after applying the function, I am using ifelse to set the value of Type to "Short Term" if there is sickness and an empty string ("") if there is not.

Current %>%
  group_by(Location) %>%
  mutate(Type = myFunction(Sick)
         , Type = ifelse(Sick == 1
                         , ifelse(!is.na(Type)
                                  , Type, "Short Term")
                         , ""))

Returns:

   Location       Date  Sick       Type
      <chr>     <date> <dbl>      <chr>
1    Madrid 2014-02-21     1 Short Term
2    Madrid 2014-02-22     1 Short Term
3    Madrid 2014-02-23     0           
4    Madrid 2014-02-24     0  Long Term
5    Madrid 2014-02-25     0  Long Term
6    Madrid 2014-02-26     1  Long Term
7    Madrid 2014-02-27     1  Long Term
8    Madrid 2014-02-28     1  Long Term
9    Madrid 2014-03-01     1  Long Term
10   Madrid 2014-03-02     0  
11   Madrid 2014-03-03     1  Long Term
12   Madrid 2014-03-04     1  Long Term
13   Madrid 2014-03-05     1  Long Term
14   Madrid 2014-03-06     1  Long Term
15   Madrid 2014-03-07     0  

and should respect the different locations (again, assumes all dates are entered and rows are sorted by date).

Upvotes: 1

ℕʘʘḆḽḘ
ℕʘʘḆḽḘ

Reputation: 19395

solution:

use rolling sums in Pandas

df['flag'] = df.sickness.rolling(window = 6).sum()
df['long_term'] = (df['flag'] > 4)

or in R

library(RcppRoll)
library(dplyr)

mutate(df, flag = RcppRoll::roll_sum(x = sickness, 2),
       long_term = if_else(flag > 4, 'Bim!', 'Boom!'))

Upvotes: 0

Related Questions