Sebastian Zeki
Sebastian Zeki

Reputation: 6874

function to subtract each column from one specific column in r

I want to subtract each column from a column called df$Means in r. I want to do this as a function but Im not sure how to iterate through each of the columns- each iteration relies on one column being subtracted from df$Means and then there is a load of downstream code that uses the output. I have simplified the code for here as this is the bit that's giving me trouble. So far I have:

CopyNumberLoop <- function (i) {df$ZScore <- (df[3:5]-df$Means)/(df$sd)
  } 
apply(df[3:50], 2, CopyNumberLoop)

but Im not sure how to make sure that the operation is done on one column at a time. I don't think df[3:5] is correct?

I have been asked to produce a reproducible example so all the code I want is here:

df1 <- read.delim(file.choose(),header=TRUE)

    #Take the control samples and average each row for three columns excluding the first two columns- add the per row means to the data frame
    df$Means <- rowMeans(df[,30:32]) 
    RowVar <- function(x) {rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)}
    df$sd=sqrt(RowVar(df[,c(30:32)]))

    #Get a Z score by dividing the test sample count at each locus by the average for the control samples and divide everything by the st dev for controls at each locus.

{ df$ZScore <- (df[,35]-df$Means)/(df$sd)

    ######################################### QUARTILE FILTER ###########################################################
    alpha=1.5
    numberofControls = 3
    UL = median(df$ZScore, na.rm = TRUE) + alpha*IQR(df$ZScore, na.rm = TRUE)
    LL = median(df$ZScore, na.rm = TRUE) - alpha*IQR(df$ZScore, na.rm = TRUE)

    #Copy the Z score if the score is > or < a certain number, i.e. LL or UL.
    Zoutliers <- which(df$ZScore > UL | df$ZScore < LL)
    df$Zoutliers <- ifelse(df$ZScore > UL |df$ZScore <LL ,1,-1)
    tempout = ifelse(df$ZScore[Zoutliers] > UL,1,-1)

    ######################################### Three neighbour Isolation filter ##############################################################################
    finalSeb=c()
    for(i in 2:(length(Zoutliers)-1)){
     j=Zoutliers[i]
     if(sum(ifelse((j-1) == Zoutliers,1,0)) > 0 & tempout[i] ==  tempout[i-1] & sum(ifelse((j+1) == Zoutliers,1,0)) > 0 & tempout[i] ==  tempout[i+1]){
       finalSeb = c(finalSeb,i)
     }  
    }
    finalset_row_number = Zoutliers[finalSeb]
    #View(finalset_row_number)
    p_seq = rep(0,nrow(df))
    for(i in 1:length(finalset_row_number)){
     p_seq[(finalset_row_number[i]-1):(finalset_row_number[i]+1)] = median(df$ZScore[(finalset_row_number[i]-1):(finalset_row_number[i]+1)])
    }

    nrow(as.data.frame(finalset_row_number))
    }

For each column between 3 and 50 I'd like to generate a nrow(as.data.frame(finalset_row_number)) and keep it in another dataframe. Admittedly my code is a mess because I dont know how to create the function that will allow me to apply this to each column

Upvotes: 0

Views: 4054

Answers (2)

IRTFM
IRTFM

Reputation: 263331

It appeared that you wanted the Z-scores assigned back into the original dataframe as named columns. If you want to loop over columns, it would be just as economical to use lapply or sapply. The receiving function will accept each column in turn and match it to the first parameter. Any other arguments offered after the receiving function will get matched by name or position to any other symbol/names in the parameter list. You do not do any assignment to 'df' inside the function:

CopyNumberLoop <- function (col) { col-df$Means/(df$sd)
                         } 
df[, paste0('ZScore' , 3:50)] <-  # assignment done outside the loop
         lapply(df[3:50], CopyNumberLoop)  # result is a list
                # but the `[.data.frame<-` method will accept a list.

Usign apply coerces to a matrix which may have undesirable effects in the column is not numeric (say factor or date-time). It's better to get into he habit of using lapply when working on ranges of columns in dataframes.

If you want to assign the result of this operation to a new dataframe, then the lapply(.) result would need to be wrapped in as.data.frame and then column names could be assigned. Same effort would need to be done to a result from apply(.).

Upvotes: 0

Konrad Rudolph
Konrad Rudolph

Reputation: 545528

Your code isn’t using the parameter i at all. In fact, i is the current column, so that’s what you should use:

result = apply(df[, 3 : 50], 2, function (col) col - df$Means)

Or you can subtract the means directly:

result = df[, 3 : 50] - df$Means

This will return a new matrix consisting of the columns 3–50 from df, subtracting df$Means from each in turn. Or, if you want to calculate Z scores as your code seems to do:

result = (df[, 3 : 50] - df$Means) / df$sd

Upvotes: 1

Related Questions