t_goul
t_goul

Reputation: 23

Imputing values which = 1, with the median of the other observations in a row. (R)

I'm an inexperienced R user and I am trying to pre-process some biological data before statistical analysis for differential expression, using linear modelling.

I want to impute values which == 1, by row in a dataframe, and I want to impute the values with the median of the row.

Here is some example data:

treatment1 <- c(125302640, 857538880 ,43258573000, 1, 1, 225966496, 204262864)
treatment2 <- c(193170560, 797860990, 35646611000, 1, 221060400, 1, 1027615810)
treatment3 <- c(208872576, 914684860, 31535493100, 1, 1, 659360130, 3709508860)
count <- c(0, 0, 0, 3, 2, 1, 0)
df <- data.frame(treatment1, treatment2, treatment3, count)

I made a column in the data frame called 'count', because I only want to impute the values in the data frame where the number of 1's in the row = 1.

I first used a single row as a test:

test.row <- df[6,1:4]
test.row
treatment1 treatment2 treatment3 count
6  225966496          1  659360130     1

I figured I would write a function that operated on a single row, and then use plyr::adply with .margins = 1, to apply the function to the whole df.

This is what I came up with:

if(test.row$count == 1) {
  median(as.numeric(test.row[1:3]))
  } else {
    test.row[1:3] 
  }
# Output = 225966496, which is what I want. 

But I am stuck with how to integrate it into a function. Here is my latest attempt:

impute.1 <- function(df, x){
  if(df$count == 1) {
    df[x == 1] <- median(as.numeric(df[x]))
    result <- df[x]
  } else {
    result <- df[x]
  }
  print(result)
}

impute.1(test.row, 1:3)

# Output = 
#   treatment1 treatment2 treatment3
# 6  225966496          1  659360130

# Desired Output = 
#   treatment1 treatment2 treatment3
# 6  225966496  225966496  659360130

So it was not able to recognise that this row had 1 count of 1, and therefore it should replace the 1 value with the median of the row.

Any advice or comments are greatly appreciated! Regards, Thomas.

Upvotes: 2

Views: 70

Answers (2)

Onyambu
Onyambu

Reputation: 79208

Another way is to use tidyverse:

library(tidyverse)
df %>% 
  rownames_to_column('rn') %>%
  pivot_longer(-c(count, rn)) %>%
  group_by(rn) %>%
  mutate(value = replace(value, value == 1 & count == 1, median(value))) %>%
  pivot_wider() %>%
  ungroup() %>%
  select(-count, everything(), count, -rn)


# A tibble: 7 x 4
   treatment1  treatment2  treatment3 count
        <dbl>       <dbl>       <dbl> <dbl>
1   125302640   193170560   208872576     0
2   857538880   797860990   914684860     0
3 43258573000 35646611000 31535493100     0
4           1           1           1     3
5           1   221060400           1     2
6   225966496   225966496   659360130     1
7   204262864  1027615810  3709508860     0

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 388907

You can use this Map approach -

cols <- 1:3

impute.1 <- function(x, count){
  if(count == 1) {
    x[x == 1] <- median(as.numeric(x))
    x
  } else x
}

df[cols] <- do.call(rbind, Map(impute.1, asplit(df[cols], 1), df$count))
df

#   treatment1  treatment2  treatment3 count
#1   125302640   193170560   208872576     0
#2   857538880   797860990   914684860     0
#3 43258573000 35646611000 31535493100     0
#4           1           1           1     3
#5           1   221060400           1     2
#6   225966496   225966496   659360130     1
#7   204262864  1027615810  3709508860     0

Upvotes: 2

Related Questions