Reputation: 13

Grouping data in R based on specific column values

I have a set of data in a csv file that I need to group based on transitions of one column. I'm new to R and I'm having trouble finding the right way to accomplish this.

Simplified version of data:

Time    Phase    Pressure    Speed
 1        0        0.015      0
 2       25        0.015      0
 3       25        0.234      0
 4       25        0.111      0
 5        0        0.567      0
 6        0        0.876      0
 7       75        0.234      0
 8       75        0.542      0
 9       75        0.543      0

The length of time that phase changes state is longer than above but I shortened everything to make it readable and this pattern continues on and on. What I'm trying to do is calculate the mean of pressure and speed for each instance where the phase is non-zero. For example, in the output from the sample above there would be two lines, one with the average of the three lines where phase is 25, and with the average of the three lines when phase is 75. It will be possible to see cases where the same numeric value of phase shows up more than once, and I need to treat each of those separately. That is, in the case where phase is 0, 0, 25, 25, 25, 0, 0, 0, 25, 25, 0, I would need to record the first group and the second group of 25s as separate events, as well as any other non-zero groups.

What i've tried:

`csv <- read.csv("c:\\test.csv")`
`ins <- subset(csv,csv$Phase == 25)`
`exs <- subset(csv,csv$Phase == 75)`
`mean(ins$Pressure)`
`mean(exs$Pressure)`

This obviously returns the average of the entire file when phase is 25 and 75, but I need to somehow split it into groups using the trailing and leading 0s. Any help is appreciated.

Upvotes: 1

Answers (3)

TARehman

Reputation: 6749

EDITED: Based on feedback from the asker, they are really seeking to do some aggregations across runs of numbers (i.e. the first group of continuous 25s, then the second group of continuous 25s, and so on). Because of that, I suggest using rle or the run-level encoding function, to get a group number that you can use in the aggregate command.

I've modified the original data so that it contains two runs of 25, just for illustrative purposes, but it should work regardless. Using rle we get the encoded runs of data, and then we create a group number for each row. We do this by getting a vector of the total number of observed lengths, and then using the rep function to repeat each one by the appropriate length.

After this is done, we can use the same basic aggregation command again.

df_example <- data.frame(Time = 1:9,
                         Phase = c(0,25,25,25,0,0,25,25,0),
                         Pressure = c(0.015,0.015,0.234,0.111,0.567,0.876,0.234,0.542,0.543),
                         Speed = rep(x = 0,times = 9))

encoded_runs <- rle(x = df_example$Phase)
df_example$Group_No <- rep(x = 1:length(x = encoded_runs$lengths),
                           times = encoded_runs$lengths)

aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
          by = list(Group_No = df_example[df_example$Phase != 0,"Group_No"],
                    Phase = df_example[df_example$Phase != 0,"Phase"]),
          FUN = mean)

  Group_No Phase Pressure Speed
1        2    25    0.120     0
2        4    25    0.388     0

Upvotes: 0

IBrum

Reputation: 345

Building upon comment by Solos, and answer by Cheesman, try:

csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))

df_example = csv

aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
        by = list(Phase = df_example[df_example$Phase != 0,"block"]),
        FUN = mean)

actually plyr would be handy:

csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))

require(plyr)

ddply(csv[csv$Phase!=0,], .(block), summarize,
mean.Pressure=mean(Pressure), mean.Speed=mean(Speed))

Upvotes: 0

Andrew Cheesman

Reputation: 150

Super quick:

df <- read.csv("your_file_name.csv")

cbind(aggregate(Pressure ~ Phase, df[df$Phase != 0,], FUN = mean),
      aggregate(Speed ~ Phase, df[df$Phase != 0,], FUN = mean)[2])

The cbind is fancy - depending on the distribution of values of Phase, you'll need to merge instead.

Upvotes: 1

Grouping data in R based on specific column values

Answers (3)

Related Questions