Reputation: 23
I'm an inexperienced R user and I am trying to pre-process some biological data before statistical analysis for differential expression, using linear modelling.
I want to impute values which == 1, by row in a dataframe, and I want to impute the values with the median of the row.
Here is some example data:
treatment1 <- c(125302640, 857538880 ,43258573000, 1, 1, 225966496, 204262864)
treatment2 <- c(193170560, 797860990, 35646611000, 1, 221060400, 1, 1027615810)
treatment3 <- c(208872576, 914684860, 31535493100, 1, 1, 659360130, 3709508860)
count <- c(0, 0, 0, 3, 2, 1, 0)
df <- data.frame(treatment1, treatment2, treatment3, count)
I made a column in the data frame called 'count', because I only want to impute the values in the data frame where the number of 1's in the row = 1.
I first used a single row as a test:
test.row <- df[6,1:4]
test.row
treatment1 treatment2 treatment3 count
6 225966496 1 659360130 1
I figured I would write a function that operated on a single row, and then use plyr::adply with .margins = 1, to apply the function to the whole df.
This is what I came up with:
if(test.row$count == 1) {
median(as.numeric(test.row[1:3]))
} else {
test.row[1:3]
}
# Output = 225966496, which is what I want.
But I am stuck with how to integrate it into a function. Here is my latest attempt:
impute.1 <- function(df, x){
if(df$count == 1) {
df[x == 1] <- median(as.numeric(df[x]))
result <- df[x]
} else {
result <- df[x]
}
print(result)
}
impute.1(test.row, 1:3)
# Output =
# treatment1 treatment2 treatment3
# 6 225966496 1 659360130
# Desired Output =
# treatment1 treatment2 treatment3
# 6 225966496 225966496 659360130
So it was not able to recognise that this row had 1 count of 1, and therefore it should replace the 1 value with the median of the row.
Any advice or comments are greatly appreciated! Regards, Thomas.
Upvotes: 2
Views: 70
Reputation: 79208
Another way is to use tidyverse
:
library(tidyverse)
df %>%
rownames_to_column('rn') %>%
pivot_longer(-c(count, rn)) %>%
group_by(rn) %>%
mutate(value = replace(value, value == 1 & count == 1, median(value))) %>%
pivot_wider() %>%
ungroup() %>%
select(-count, everything(), count, -rn)
# A tibble: 7 x 4
treatment1 treatment2 treatment3 count
<dbl> <dbl> <dbl> <dbl>
1 125302640 193170560 208872576 0
2 857538880 797860990 914684860 0
3 43258573000 35646611000 31535493100 0
4 1 1 1 3
5 1 221060400 1 2
6 225966496 225966496 659360130 1
7 204262864 1027615810 3709508860 0
Upvotes: 3
Reputation: 388907
You can use this Map
approach -
cols <- 1:3
impute.1 <- function(x, count){
if(count == 1) {
x[x == 1] <- median(as.numeric(x))
x
} else x
}
df[cols] <- do.call(rbind, Map(impute.1, asplit(df[cols], 1), df$count))
df
# treatment1 treatment2 treatment3 count
#1 125302640 193170560 208872576 0
#2 857538880 797860990 914684860 0
#3 43258573000 35646611000 31535493100 0
#4 1 1 1 3
#5 1 221060400 1 2
#6 225966496 225966496 659360130 1
#7 204262864 1027615810 3709508860 0
Upvotes: 2