Reputation: 684
I currently have a dataset which shows the number of people sick by location per day. For each group (Location) there is an output of 1 (sick) or 0 (not sick) in the 'Sick' column to state whether there is sickness in the area.
My aim is to create a new column that identifies periods of long term sickness and labels any row that contributes to this.
The following data frame shows the sample data:
Current <- structure(list(Location = c("Madrid", "Madrid", "Madrid", "Madrid",
"Madrid", "Madrid", "Madrid", "Madrid", "Madrid", "Madrid", "Madrid",
"Madrid", "Madrid", "Madrid", "Madrid"), Date = structure(c(16122,
16123, 16124, 16125, 16126, 16127, 16128, 16129, 16130, 16131,
16132, 16133, 16134, 16135, 16136), class = "Date"), Sick = c(1,
1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0)), .Names = c("Location",
"Date", "Sick"), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
The rule I would like to apply to the new column is the following:
*# If in any 6 day period there are 4 days of sickness, all days with a sickness recorded should have “Long Term” entered in the new computed variable ('Type'). If this condition is not met then “Short Term” should be entered in the new computed variable which is called 'Type'.
For this to work I need it perform the calculation by group. Location is the group in this analysis so I aim to be able to class days based on location. I have not been able to work this out.
The desired output would look like this:
Desired <- structure(list(Location = c("Madrid", "Madrid", "Madrid", "Madrid",
"Madrid", "Madrid", "Madrid", "Madrid", "Madrid", "Madrid", "Madrid",
"Madrid", "Madrid", "Madrid", "Madrid"), Date = structure(c(16122,
16123, 16124, 16125, 16126, 16127, 16128, 16129, 16130, 16131,
16132, 16133, 16134, 16135, 16136), class = "Date"), Sick = c(1,
1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0), Type = c(NA, NA, NA,
NA, NA, "Long Term", "Long Term", "Long Term", "Long Term", NA,
"Long Term", "Long Term", "Long Term", "Long Term", NA)), .Names = c("Location",
"Date", "Sick", "Type"), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
Any help on this is much appreciated and also on any feedback regarding the question style and layout is helpful.
Currently, I have the problem of only the Window term that meets the ifelse criteria being labelled 'Long Term'. What I need is for all observations that form the window testing and have sickness as 1 to be labelled 'Long Term':
Upvotes: 0
Views: 77
Reputation: 9570
One approach would be to follow @Noobie's suggestions to find the start of the "Long Term" periods, then extend the entries from there. Here, I am using rollsum
from zoo
instead, just because that is what I am familiar with working with. Importantly, this assumes that there is an entry for every day (if not, consider using complete
with full_seq
from tidyr
) and that they are in order (if not, use arrange
).
First, find spots that start long term periods:
temp <-
Current %>%
mutate(rollCount = zoo::rollsum(Sick, 6
, fill = NA
, align = "left")
, startLong = rollCount >= 4)
Then, add 5 days to each of those starting points (using unique
to avoid re-setting multiple times):
toFill <-
sapply(which(temp$startLong), function(x){
x + (0:5)
}) %>%
as.numeric() %>%
unique()
Then, add a column for the Type
, and set it to Long Term
just for those that we pulled in toFill
:
temp$Type <- NA
temp$Type[toFill] <- "Long Term"
Returns:
Location Date Sick rollCount startLong Type
<chr> <date> <dbl> <dbl> <lgl> <chr>
1 Madrid 2014-02-21 1 3 FALSE <NA>
2 Madrid 2014-02-22 1 3 FALSE <NA>
3 Madrid 2014-02-23 0 3 FALSE <NA>
4 Madrid 2014-02-24 0 4 TRUE Long Term
5 Madrid 2014-02-25 0 4 TRUE Long Term
6 Madrid 2014-02-26 1 5 TRUE Long Term
7 Madrid 2014-02-27 1 5 TRUE Long Term
8 Madrid 2014-02-28 1 5 TRUE Long Term
9 Madrid 2014-03-01 1 5 TRUE Long Term
10 Madrid 2014-03-02 0 4 TRUE Long Term
11 Madrid 2014-03-03 1 NA NA Long Term
12 Madrid 2014-03-04 1 NA NA Long Term
13 Madrid 2014-03-05 1 NA NA Long Term
14 Madrid 2014-03-06 1 NA NA Long Term
15 Madrid 2014-03-07 0 NA NA Long Term
If you want to use this for multiple different locations, it is probably best to wrap it in a function, then use group_by
and mutate
to add the column.
myFunction <- function(x){
temp <-
data.frame(Sick = x) %>%
mutate(rollCount = zoo::rollsum(Sick, 6
, fill = NA
, align = "left")
, startLong = rollCount >= 4)
toFill <-
sapply(which(temp$startLong), function(x){
x + (0:5)
}) %>%
as.numeric() %>%
unique()
temp$Type <- NA
temp$Type[toFill] <- "Long Term"
return(temp$Type)
}
Then, apply it to the data. Note that after applying the function, I am using ifelse
to set the value of Type
to "Short Term" if there is sickness and an empty string ("") if there is not.
Current %>%
group_by(Location) %>%
mutate(Type = myFunction(Sick)
, Type = ifelse(Sick == 1
, ifelse(!is.na(Type)
, Type, "Short Term")
, ""))
Returns:
Location Date Sick Type
<chr> <date> <dbl> <chr>
1 Madrid 2014-02-21 1 Short Term
2 Madrid 2014-02-22 1 Short Term
3 Madrid 2014-02-23 0
4 Madrid 2014-02-24 0 Long Term
5 Madrid 2014-02-25 0 Long Term
6 Madrid 2014-02-26 1 Long Term
7 Madrid 2014-02-27 1 Long Term
8 Madrid 2014-02-28 1 Long Term
9 Madrid 2014-03-01 1 Long Term
10 Madrid 2014-03-02 0
11 Madrid 2014-03-03 1 Long Term
12 Madrid 2014-03-04 1 Long Term
13 Madrid 2014-03-05 1 Long Term
14 Madrid 2014-03-06 1 Long Term
15 Madrid 2014-03-07 0
and should respect the different locations (again, assumes all dates are entered and rows are sorted by date).
Upvotes: 1
Reputation: 19395
solution:
use rolling sums in Pandas
df['flag'] = df.sickness.rolling(window = 6).sum()
df['long_term'] = (df['flag'] > 4)
or in R
library(RcppRoll)
library(dplyr)
mutate(df, flag = RcppRoll::roll_sum(x = sickness, 2),
long_term = if_else(flag > 4, 'Bim!', 'Boom!'))
Upvotes: 0