Reputation: 17
I am a bit stuck on a specific problem in R, for which I only have a a long-winded and unelegant solution. Maybe somebody knows a better way of doing this. Lets say we have a data frame like this:
x <- c("A", "B", "C", "B", "A", "C", "C", "B", "A", "B", "A", "C")
z <- c(1, 1, 1, 2, 2, 2,3, 3, 3, 4, 4,4)
y <- c(43, 32, 45, 32, 22, 52, 23, 13, 12, 4, 12, 5)
df <- data.frame(x,z,y)
with the data looking like this: plot
I am trying to calculate the difference between the y values within the x groupings, depending on z. E.g. the difference between A of group 1 and A of group 2 (43-22=21) and A of group 2 and A of group 3 (22-12=10) and so on. I can do this very ugly like this:
ordered.df<-df[order(df$z, df$x),]
bl<-ordered.df[ordered.df$z==1,]
bl2<-ordered.df[ordered.df$z==2,]
bl3<-ordered.df[ordered.df$z==3,]
bl4<-ordered.df[ordered.df$z==4,]
first <- bl$y - bl2$y
second <- bl2$y - bl3$y
third <- bl3$y - bl4$y
ycolumn <- c(first,second,third)
xcolumn <- rep(c("A","B","C"),3)
zcolumn <- rep(1:3,each=3)
final.df <- data.frame(xcolumn,zcolumn,ycolumn)
and was wondering if there is any more elegenat and scalable solutions.
Ultimately, I want to count the occurances of positive differences:
final.df$lower <- 0
final.df$lower[final.df$ycolumn>0] <- 1
aggregate(lower ~ zcolumn, final.df, sum)
Any suggestions are welcome! Thanks!
Upvotes: 0
Views: 325
Reputation: 3183
You can use dplyr
and mutate
for each group:
library(dplyr)
final.df <- df %>%
group_by(x) %>%
mutate(y = y - lead(y, 1)) %>%
arrange(z, x) %>%
filter(!is.na(y))
final.df
# A tibble: 9 x 3
# Groups: x [3]
x z y
<fct> <dbl> <dbl>
1 A 1 21
2 B 1 0
3 C 1 -7
4 A 2 10
5 B 2 19
6 C 2 29
7 A 3 0
8 B 3 9
9 C 3 18
And use summarise
for the aggregate:
final.df %>%
group_by(z) %>%
summarise(lower = sum(y > 0))
# A tibble: 3 x 2
z lower
<dbl> <dbl>
1 1 1
2 2 3
3 3 2
Upvotes: 1
Reputation: 27762
Here is a data.table one-liner
setDT(df)[, list(z = z, y_diff = y - shift(y, 1, type = "lead")), by = .(x = x)][ y_diff > 0, list(lower = .N), by = "z"]
# z lower
#1: 1 1
#2: 2 3
#3: 3 2
what it does:
setDT(df)
make a data.table out of df
[, list(z = z, y_diff = y - shift(y, 1, type = "lead")), by = .(x = x)]
group by x, subtract the next y from the current y and create column y_diff
with the outcome
[ y_diff > 0, list(lower = .N), by = "z"]
for all rows where y_diff
is greater than 0, give the number of rows (=.N
), grouped by z
Upvotes: 2
Reputation: 31
If you actual data only has a small number of groups (z), but large number of categories (x), you can convert the data to wide format and start from there. Here is a solution for the simple example.
> library(data.table)
> library(reshape2)
>
> df_reshape = dcast(x~z, data=df, value.var="y")
> setDT(df_reshape)
> setnames(df_reshape,c(as.character(seq(1,4,1))),c(paste("x",as.character(seq(1,4,1)), sep="")))
> head(df_reshape)
x x1 x2 x3 x4
1: A 43 22 12 12
2: B 32 32 13 4
3: C 45 52 23 5
> df_reshape[,.(diff1=x1-x2,diff2=x2-x3,diff3=x3-x4),by=x][,.(group1 = sum(diff1>0),group2 =sum(diff2>0),group3 = sum(diff3>0))]
group1 group2 group3
1: 1 3 2
Upvotes: 1