Reputation: 13
I have a set of data in a csv file that I need to group based on transitions of one column. I'm new to R and I'm having trouble finding the right way to accomplish this.
Simplified version of data:
Time Phase Pressure Speed
1 0 0.015 0
2 25 0.015 0
3 25 0.234 0
4 25 0.111 0
5 0 0.567 0
6 0 0.876 0
7 75 0.234 0
8 75 0.542 0
9 75 0.543 0
The length of time that phase changes state is longer than above but I shortened everything to make it readable and this pattern continues on and on. What I'm trying to do is calculate the mean of pressure and speed for each instance where the phase is non-zero. For example, in the output from the sample above there would be two lines, one with the average of the three lines where phase is 25, and with the average of the three lines when phase is 75. It will be possible to see cases where the same numeric value of phase shows up more than once, and I need to treat each of those separately. That is, in the case where phase is 0, 0, 25, 25, 25, 0, 0, 0, 25, 25, 0
, I would need to record the first group and the second group of 25s as separate events, as well as any other non-zero groups.
What i've tried:
`csv <- read.csv("c:\\test.csv")`
`ins <- subset(csv,csv$Phase == 25)`
`exs <- subset(csv,csv$Phase == 75)`
`mean(ins$Pressure)`
`mean(exs$Pressure)`
This obviously returns the average of the entire file when phase is 25 and 75, but I need to somehow split it into groups using the trailing and leading 0s. Any help is appreciated.
Upvotes: 1
Views: 2643
Reputation: 6749
EDITED: Based on feedback from the asker, they are really seeking to do some aggregations across runs of numbers (i.e. the first group of continuous 25s, then the second group of continuous 25s, and so on). Because of that, I suggest using rle
or the run-level encoding function, to get a group number that you can use in the aggregate
command.
I've modified the original data so that it contains two runs of 25, just for illustrative purposes, but it should work regardless. Using rle
we get the encoded runs of data, and then we create a group number for each row. We do this by getting a vector of the total number of observed lengths, and then using the rep
function to repeat each one by the appropriate length.
After this is done, we can use the same basic aggregation command again.
df_example <- data.frame(Time = 1:9,
Phase = c(0,25,25,25,0,0,25,25,0),
Pressure = c(0.015,0.015,0.234,0.111,0.567,0.876,0.234,0.542,0.543),
Speed = rep(x = 0,times = 9))
encoded_runs <- rle(x = df_example$Phase)
df_example$Group_No <- rep(x = 1:length(x = encoded_runs$lengths),
times = encoded_runs$lengths)
aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
by = list(Group_No = df_example[df_example$Phase != 0,"Group_No"],
Phase = df_example[df_example$Phase != 0,"Phase"]),
FUN = mean)
Group_No Phase Pressure Speed
1 2 25 0.120 0
2 4 25 0.388 0
Upvotes: 0
Reputation: 345
Building upon comment by Solos, and answer by Cheesman, try:
csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))
df_example = csv
aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
by = list(Phase = df_example[df_example$Phase != 0,"block"]),
FUN = mean)
actually plyr
would be handy:
csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))
require(plyr)
ddply(csv[csv$Phase!=0,], .(block), summarize,
mean.Pressure=mean(Pressure), mean.Speed=mean(Speed))
Upvotes: 0
Reputation: 150
Super quick:
df <- read.csv("your_file_name.csv")
cbind(aggregate(Pressure ~ Phase, df[df$Phase != 0,], FUN = mean),
aggregate(Speed ~ Phase, df[df$Phase != 0,], FUN = mean)[2])
The cbind
is fancy - depending on the distribution of values of Phase, you'll need to merge
instead.
Upvotes: 1