Reputation: 432

Calculate mean between consecutive rows that meet specifc condition in two columns

I have a df like this (a simplified version):

lab_id  weeks GM.CSF IFNa2 IFNg IL10 IL12p40
1  op1 2020 G4 week_1 1.6900 13.0258  5.0755 3.3068 1.3
2  op1 2020 G4 week_4 1.6900  4.4113  3.9592 2.0100 1.3
3 op10 2020 G4 week_4 2.4236  8.3186 41.7559 2.0100 1.3
4 op10 2020 G4 week_4 3.2600 18.3118 12.5456 2.0100 1.3

I'm trying to evaluate rows that are consecutive and meet specific criteria in two columns.

The column lab_id has repeat strings, however, if you look at the weeks column, you can have different values, in my example just appears week_1 and 4, but I have 2 and 3 as well.

So, the condition is:

if column lab_id has the same value in two consecutive rows;
evaluate the string in the weeks column, if week is different, do nothing;
if the string in weeks column from two consecutive rows is the same perform the following operation;
calculate the mean of the numerical values from the two consecutive rows;
then replace the row for the new row with the mean calculated values;

For example, in rows 1 and 2, I have:

op1 2020 G4 week_1
op1 2020 G4 week_4

The code does need to do nothing, however, for rows 3 and 4, I have:

op10 2020 G4 week_4
op10 2020 G4 week_4

So, in this case, I need to calculate the mean of GM.CSF IFNa2 IFNg IL10 IL12p40 between the two consecutive rows that meet the mentioned criteria. For the rows 3 and 4, my expected result must be:

lab_id  weeks GM.CSF IFNa2 ...
3 op10 2020 G4 week_4 2.8418 13.3152 ...

As you can see, the new value is the mean from the numerical column from two consecutive rows, and now I only have one row with unique lab_id and weeks values without duplicating those two columns at the same time.

My idea for the code is to use rowise() function and an if_else statement.

test %>%  
  rowwise() %>%
  mutate(result = eval(parse(string = condition)))

However, I think that this is not the correct way, because I'm evaluating two consecutive rows with specific string values in two columns, so maybe mutate() is a better alternative.

mutate() function to evaluate the condition before operating:

test %>%
    mutate(duplicated_w = duplicated(lab_id) & weeks == "week_4")

So, if condition is TRUE, perform mean() from values in the column, however, this condition is not been evaluated correctly, because is returning me TRUE in rows where the condition does not meet the criteria.

These two ideas have different versions in my code, however non of them have been successful so far.

Questions:

Is there any function to evaluate columns and consecutive rows at the same time?
What can I do with my current alternatives?

Upvotes: 1

Answers (3)

Friede

Reputation: 7390

Quite similar to @jay.sf's answer, but different enough (I suspect)¹:

(by(X, interaction(X$lab_id, X$weeks, drop = TRUE), 
    \(x) cbind(x[1L, c("lab_id", "weeks")], t(colMeans(Filter(is.numeric, x))))) |> 
    array2DF())[-1L]
        lab_id  weeks GM.CSF   IFNa2     IFNg   IL10 IL12p40
1  op1 2020 G4 week_1 1.6900 13.0258  5.07550 3.3068     1.3
2  op1 2020 G4 week_4 1.6900  4.4113  3.95920 2.0100     1.3
3 op10 2020 G4 week_4 2.8418 13.3152 27.15075 2.0100     1.3

In case you are dealing with a massive amount of data, we might translate to {collapse} syntax.

Using {collapse}

library(collapse)
X |> fgroup_by(lab_id, weeks) |> fmean()

¹ Please let me know if not.

Upvotes: 1

Onyambu

Reputation: 79208

Base R functions:

aggregate(.~lab_id + weeks, dat, mean)

  lab_id          weeks GM.CSF   IFNa2     IFNg   IL10 IL12p40
1    op1 2020_G4_week_1 1.6900 13.0258  5.07550 3.3068     1.3
2    op1 2020_G4_week_4 1.6900  4.4113  3.95920 2.0100     1.3
3   op10 2020_G4_week_4 2.8418 13.3152 27.15075 2.0100     1.3

na.omit(array2DF(tapply(dat[-(1:2)], dat[1:2], colMeans)))

  lab_id          weeks GM.CSF   IFNa2     IFNg   IL10 IL12p40
1    op1 2020_G4_week_1 1.6900 13.0258  5.07550 3.3068     1.3
3    op1 2020_G4_week_4 1.6900  4.4113  3.95920 2.0100     1.3
4   op10 2020_G4_week_4 2.8418 13.3152 27.15075 2.0100     1.3

Upvotes: 2

jay.sf

Reputation: 72633

Using by to split the data based on the interaction(lab_id, weeks), calculating colMeans, rbind.

> by(dat, with(dat, interaction(lab_id, weeks)), \(.) {
+   if (nrow(.) > 1) {
+     data.frame(.[1, 1:2], t(colMeans(.[-(1:2)])))
+   } else .
+ }) |> c(make.row.names=FALSE) |> do.call(what='rbind')
  lab_id          weeks GM.CSF   IFNa2     IFNg   IL10 IL12p40
1    op1 2020_G4_week_1 1.6900 13.0258  5.07550 3.3068     1.3
2    op1 2020_G4_week_4 1.6900  4.4113  3.95920 2.0100     1.3
3   op10 2020_G4_week_4 2.8418 13.3152 27.15075 2.0100     1.3

Data:

> dput(dat)
structure(list(lab_id = c("op1", "op1", "op10", "op10"), weeks = c("2020_G4_week_1", 
"2020_G4_week_4", "2020_G4_week_4", "2020_G4_week_4"), GM.CSF = c(1.69, 
1.69, 2.4236, 3.26), IFNa2 = c(13.0258, 4.4113, 8.3186, 18.3118
), IFNg = c(5.0755, 3.9592, 41.7559, 12.5456), IL10 = c(3.3068, 
2.01, 2.01, 2.01), IL12p40 = c(1.3, 1.3, 1.3, 1.3)), class = "data.frame", row.names = c("1", 
"2", "3", "4"))

Upvotes: 1

Calculate mean between consecutive rows that meet specifc condition in two columns

Answers (3)

Related Questions