Reputation: 3195
I have this dataset
df=structure(list(Dt = structure(1:39, .Label = c("2018-02-20 00:00:00.000",
"2018-02-21 00:00:00.000", "2018-02-22 00:00:00.000", "2018-02-23 00:00:00.000",
"2018-02-24 00:00:00.000", "2018-02-25 00:00:00.000", "2018-02-26 00:00:00.000",
"2018-02-27 00:00:00.000", "2018-02-28 00:00:00.000", "2018-03-01 00:00:00.000",
"2018-03-02 00:00:00.000", "2018-03-03 00:00:00.000", "2018-03-04 00:00:00.000",
"2018-03-05 00:00:00.000", "2018-03-06 00:00:00.000", "2018-03-07 00:00:00.000",
"2018-03-08 00:00:00.000", "2018-03-09 00:00:00.000", "2018-03-10 00:00:00.000",
"2018-03-11 00:00:00.000", "2018-03-12 00:00:00.000", "2018-03-13 00:00:00.000",
"2018-03-14 00:00:00.000", "2018-03-15 00:00:00.000", "2018-03-16 00:00:00.000",
"2018-03-17 00:00:00.000", "2018-03-18 00:00:00.000", "2018-03-19 00:00:00.000",
"2018-03-20 00:00:00.000", "2018-03-21 00:00:00.000", "2018-03-22 00:00:00.000",
"2018-03-23 00:00:00.000", "2018-03-24 00:00:00.000", "2018-03-25 00:00:00.000",
"2018-03-26 00:00:00.000", "2018-03-27 00:00:00.000", "2018-03-28 00:00:00.000",
"2018-03-29 00:00:00.000", "2018-03-30 00:00:00.000"), class = "factor"),
ItemRelation = c(158043L, 158043L, 158043L, 158043L, 158043L,
158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L,
158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L,
158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L,
158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L,
158043L, 158043L, 158043L, 158043L, 158043L, 158043L), stuff = c(200L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 3600L, 0L, 0L, 0L, 0L,
700L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1000L,
2600L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 700L), num = c(1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L), year = c(2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L), action = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)), .Names = c("Dt", "ItemRelation",
"stuff", "num", "year", "action"), class = "data.frame", row.names = c(NA,
-39L))
The action column has only two values 0 and 1. i must calculate median by stuff for 1 category of action, then median by stuff of zero category of action, using last five integer values before one category. I just take the last 5 observations, It is necessary to take the last 5 observations in the zero category of action, but only the integer value, and not calculate the median by all values of zero category. In our case this is
200
3600
700
1000
2600
then substract median of zero category from median of one category.
The number of observations by stuff in the zero category of action can vary from 0-10. If we have 10 integer values of zero category, we take last five. If there is only 1,2,3,4,5 values integer, we subtract median of real number of integer values. If we have only 0 without integer , we just substact 0.
this solution of Akshay from adjacent topic How to subtract a median only from integer value helped me
df.0 <- df %>% filter(action == 0 & stuff != 0) %>% arrange(Dt) %>% top_n(5)
df.1 <- df %>% filter(action==1 & stuff!=0)
new.df <- rbind(df.0,df.1)
View(
df %>% select (everything()) %>% group_by(ItemRelation, num, year) %>%
summarise(
median.1 = median(stuff[action == 1 & stuff != 0], na.rm = T),
median.0 = median(stuff[action == 0 &
stuff != 0], na.rm = T)
) %>%
mutate(
value = median.1 - median.0,
DocumentNum = num,
DocumentYear = year
) %>%
select(ItemRelation, DocumentNum, DocumentYear, value)
But code calculate the median by all obs of zero category of action, it must calculate the median by zero category, but 5 last obs before one category.
If anybody help me in original , i.e. adjacent topic, i ll just delete this new topic,not to produce related topics.
out
put <- data.frame(mydat[which.max(as.Date(mydat$Dt)),
c("CustomerName","ItemRelation","DocumentNum","DocumentYear")],
value = m,
row.names = 1:length(which.max(as.Date(mydat$Dt))))
CustomerName ItemRelation DocumentNum DocumentYear value
1 orange TC 157214 1529 2018 162
why i get the only for one string? output must be as example. there are many stratum.not one
CustomerName ItemRelation DocumentNum DocumentYear value
1 orange TC 157214 1529 2018 162
2 appleTC 5 1529 2018 164
Upvotes: 0
Views: 54
Reputation: 5281
It is not quite clear to me what exactly you whish to accomplish. However, that might be of help.
You can subset the part of the data you need using which
and intersect
:
# df with action 0 and stuff > 0
v <- df$stuff[intersect(which(df$action == 0),
which(df$stuff > 0))]
# df with action 1 and stuff > 0
w <- df$stuff[intersect(which(df$action == 1),
which(df$stuff > 0))]
v
contains all elements of stuff
where action
is 0
and stuff
is not 0
. From here on now, calculating the median is a formality. (You might want to add safety measures in case intersect(...)
is empty, e.g. if stuff
is always 0
when action
is 0
).
# calulating the median of v for the last 5 observations
l <- length(v)
m0 <- median(v[(l-4):l]) # taking the median of the last 5 observations
# computing the final difference
m <- median(w) - m0
Edit
To reproduce the above out put, consider
output <- data.frame(df[which.max(as.Date(df$Dt)),
c("Dt","ItemRelation","num","year")],
value = m,
row.names = 1:length(which.max(as.Date(df$Dt))))
where which.max(as.Date(df$Dt))
gives the row number of the latest date. However, the logic you are applying to get that result might differ so caution is advised here.
Anyway, here it the output
> output
Dt ItemRelation num year value
1 2018-03-30 00:00:00.000 158043 1459 2018 -300
Upvotes: 1