Reputation: 432
I have a df like this (a simplified version):
lab_id weeks GM.CSF IFNa2 IFNg IL10 IL12p40
1 op1 2020 G4 week_1 1.6900 13.0258 5.0755 3.3068 1.3
2 op1 2020 G4 week_4 1.6900 4.4113 3.9592 2.0100 1.3
3 op10 2020 G4 week_4 2.4236 8.3186 41.7559 2.0100 1.3
4 op10 2020 G4 week_4 3.2600 18.3118 12.5456 2.0100 1.3
I'm trying to evaluate rows that are consecutive and meet specific criteria in two columns.
The column lab_id
has repeat strings, however, if you look at the weeks
column, you can have different values, in my example just appears week_1 and 4
, but I have 2 and 3
as well.
So, the condition is:
lab_id
has the same value in two consecutive rows;weeks
column, if week is different, do nothing;weeks
column from two consecutive rows is the same perform the following operation;For example, in rows 1 and 2, I have:
op1 2020 G4 week_1
op1 2020 G4 week_4
The code does need to do nothing, however, for rows 3 and 4, I have:
op10 2020 G4 week_4
op10 2020 G4 week_4
So, in this case, I need to calculate the mean of GM.CSF IFNa2 IFNg IL10 IL12p40
between the two consecutive rows that meet the mentioned criteria.
For the rows 3 and 4, my expected result must be:
lab_id weeks GM.CSF IFNa2 ...
3 op10 2020 G4 week_4 2.8418 13.3152 ...
As you can see, the new value is the mean from the numerical column from two consecutive rows, and now I only have one row with unique lab_id
and weeks
values without duplicating those two columns at the same time.
My idea for the code is to use rowise()
function and an if_else
statement.
test %>%
rowwise() %>%
mutate(result = eval(parse(string = condition)))
However, I think that this is not the correct way, because I'm evaluating two consecutive rows with specific string values in two columns, so maybe mutate()
is a better alternative.
mutate()
function to evaluate the condition before operating:
test %>%
mutate(duplicated_w = duplicated(lab_id) & weeks == "week_4")
So, if condition is TRUE
, perform mean()
from values in the column,
however, this condition is not been evaluated correctly, because is returning me TRUE
in rows where the condition does not meet the criteria.
These two ideas have different versions in my code, however non of them have been successful so far.
Questions:
Upvotes: 1
Views: 66
Reputation: 7390
Quite similar to @jay.sf's answer, but different enough (I suspect)1:
(by(X, interaction(X$lab_id, X$weeks, drop = TRUE),
\(x) cbind(x[1L, c("lab_id", "weeks")], t(colMeans(Filter(is.numeric, x))))) |>
array2DF())[-1L]
lab_id weeks GM.CSF IFNa2 IFNg IL10 IL12p40
1 op1 2020 G4 week_1 1.6900 13.0258 5.07550 3.3068 1.3
2 op1 2020 G4 week_4 1.6900 4.4113 3.95920 2.0100 1.3
3 op10 2020 G4 week_4 2.8418 13.3152 27.15075 2.0100 1.3
In case you are dealing with a massive amount of data, we might translate to {collapse}
syntax.
Using {collapse}
library(collapse)
X |> fgroup_by(lab_id, weeks) |> fmean()
1 Please let me know if not.
Upvotes: 1
Reputation: 79208
Base R functions:
aggregate(.~lab_id + weeks, dat, mean)
lab_id weeks GM.CSF IFNa2 IFNg IL10 IL12p40
1 op1 2020_G4_week_1 1.6900 13.0258 5.07550 3.3068 1.3
2 op1 2020_G4_week_4 1.6900 4.4113 3.95920 2.0100 1.3
3 op10 2020_G4_week_4 2.8418 13.3152 27.15075 2.0100 1.3
na.omit(array2DF(tapply(dat[-(1:2)], dat[1:2], colMeans)))
lab_id weeks GM.CSF IFNa2 IFNg IL10 IL12p40
1 op1 2020_G4_week_1 1.6900 13.0258 5.07550 3.3068 1.3
3 op1 2020_G4_week_4 1.6900 4.4113 3.95920 2.0100 1.3
4 op10 2020_G4_week_4 2.8418 13.3152 27.15075 2.0100 1.3
Upvotes: 2
Reputation: 72633
Using by
to split the data based on the interaction(lab_id, weeks)
, calculating colMeans
, rbind
.
> by(dat, with(dat, interaction(lab_id, weeks)), \(.) {
+ if (nrow(.) > 1) {
+ data.frame(.[1, 1:2], t(colMeans(.[-(1:2)])))
+ } else .
+ }) |> c(make.row.names=FALSE) |> do.call(what='rbind')
lab_id weeks GM.CSF IFNa2 IFNg IL10 IL12p40
1 op1 2020_G4_week_1 1.6900 13.0258 5.07550 3.3068 1.3
2 op1 2020_G4_week_4 1.6900 4.4113 3.95920 2.0100 1.3
3 op10 2020_G4_week_4 2.8418 13.3152 27.15075 2.0100 1.3
Data:
> dput(dat)
structure(list(lab_id = c("op1", "op1", "op10", "op10"), weeks = c("2020_G4_week_1",
"2020_G4_week_4", "2020_G4_week_4", "2020_G4_week_4"), GM.CSF = c(1.69,
1.69, 2.4236, 3.26), IFNa2 = c(13.0258, 4.4113, 8.3186, 18.3118
), IFNg = c(5.0755, 3.9592, 41.7559, 12.5456), IL10 = c(3.3068,
2.01, 2.01, 2.01), IL12p40 = c(1.3, 1.3, 1.3, 1.3)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Upvotes: 1