rg6
rg6

Reputation: 329

Custom data-dependent recoding to logicals in R

I have two data frames, data and meta. Some, but not all, columns in data are logical values, but they are coded in many different ways. The rows in meta describe the columns in data, indicate whether they are to be interpreted as logicals, and if so, what single value codes TRUE and what single value codes FALSE.

I need a procedure that replaces all data values in conceptually logical columns with the appropriate logical values from the codes in the corresponding meta row. Any data values in a conceptually logical column that do not match a value in the corresponding meta row should become NA.

Small toy example for meta:

name                 type     false  true
-----------------------------------------
a.char.var           char     NA     NA
a.logical.var        logical  NA     7
another.logical.var  logical  1      0
another.char.var     char     NA     NA

Small toy example for data:

a.char.var  a.logical.var  another.logical.var  another.char.var
----------------------------------------------------------------
aa          7              0                    ba
ab          NA             1                    bb
ac          7              NA                   bc
ad          4              3                    bd

Small toy example output:

a.char.var  a.logical.var  another.logical.var  another.char.var
----------------------------------------------------------------
aa          TRUE           TRUE                 ba
ab          FALSE          FALSE                bb
ac          TRUE           NA                   bc
ad          NA             NA                   bd

I cannot, for the life of me, find a way to do this in idiomatic R that handles all the corner cases. The data sets are large, so an idiomatic solution would be ideal if possible. I inherited this absolutely insane data management mess and will be grateful to anybody who can help fix it. I am by no means an R guru, but this seems like a deceptively difficult problem.

Upvotes: 1

Views: 48

Answers (2)

ds440
ds440

Reputation: 891

First we set up the data

meta <- data.frame(name=c('a.char.var', 'a.logical.var', 'another.logical.var', 'another.char.var'),
                   type=c('char', 'logical', 'logical', 'char'),
                   false=c(NA, NA, 1, NA),
                   true=c(NA, 7, 0, NA), stringsAsFactors = F)

data <- data.frame(a.char.var=c('aa', 'ab', 'ac', 'ad'),
                   a.logical.var=c(7, NA, 7, 4),
                   another.logical.var=c(0,1,NA,3),
                   another.char.var=c('ba', 'bb', 'bc', 'bd'), stringsAsFactors = F)

Then we subset out just the logical columns. We will iterate through these, using the name column to select the relevant column in data, and change values in data_out from an initialized NA to either T or F according to matching values in data.

Note that data[,logical_meta$name[1]] is equivalent to data[,'a.logical.var'] or data$a.logical.var, if logical_meta$name is a character. If it's a factor (eg if we didn't specify stringsAsFactors=F) we need to convert to character at which point we might as well give it a name - colname below.

Having NAs to contend with means using which is advantageous: c(0, 1,NA,3)==0 returns T,F,NA,F but which then ignores the NA and returns just the position 1. Subsetting by a logical vector that includes NAs yields NA rows or columns, using which eliminates this.

logical_meta <- meta[meta$type=='logical',]

data_out <- data #initialize


for(i in 1:nrow(logical_meta)) {
  colname <- as.character(logical_meta$name[i]) #only need as.character if factor
  data_out[,colname] <- NA
  #false column first
  if(is.na(logical_meta$false[i])) {
    data_out[is.na(data[,colname]),colname] <- FALSE
  } else {
    data_out[which(data[,colname]==logical_meta$false[i]),
             colname] <- FALSE
  }
  #true column next
  if(is.na(logical_meta$true[i])) {
    data_out[is.na(data[,colname]),colname] <- TRUE
  } else {
    data_out[which(data[,colname]==logical_meta$true[i]),
             colname] <- TRUE
  }
}

data_out

Upvotes: 1

bouncyball
bouncyball

Reputation: 10771

I've written a function that takes in the column index of data and tries to perform the operation you described.

The function first selects x as the column we are interested in. We then match the name of the column in data to the entries in the first column of meta, this gives our row of interest.

We then check if the column type is logical, if it isn't we just return x, nothing needed to be changed. If the column type is logical we then check whether its values match the true or false columns in meta.

convert_data <- function(colindex, dat, meta = meta){
    x <- dat[,colindex] #select our data vector

    #match the column name to the first column in meta
    find_in_meta <- match(names(dat)[colindex],
                          meta[,1])

    #what type of column is it
    type_col <- meta[find_in_meta,2]

    if(type_col != 'logical'){
        return(x)
    }else{
        #fix if logical is NA
        true_val <- ifelse(is.na(meta[find_in_meta,4]),'NA_val',
                           meta[find_in_meta,4])

        #fix if logical is NA
        false_val <- ifelse(is.na(meta[find_in_meta,3]), 'NA_val',
                            meta[find_in_meta, 3])

        #fix if logical is NA
        x <- ifelse(is.na(x), 'NA_val', x)
        x <- ifelse(x == true_val, TRUE,
               ifelse(x == false_val, FALSE, NA))
        return(x)
    }
}

We can then use lapply and a little data manipulation to get it into an acceptable form:

res <- lapply(1:ncol(df1), function(ind) 
                      convert_data(colindex = ind, dat = df1, meta = meta))

setNames(do.call('cbind.data.frame', res), names(df1))

  a.char.var a.logical.var another.logical.var another.char.var
1         aa          TRUE                TRUE               ba
2         ab         FALSE               FALSE               bb
3         ac          TRUE                  NA               bc
4         ad            NA                  NA               bd

data

meta <- structure(list(name = c("a.char.var", "a.logical.var", "another.logical.var", 
"another.char.var"), type = c("char", "logical", "logical", "char"
), false = c(NA, NA, 1L, NA), true = c(NA, 7L, 0L, NA)), .Names = c("name", 
"type", "false", "true"), class = "data.frame", row.names = c(NA, 
-4L))

df1 <- structure(list(a.char.var = c("aa", "ab", "ac", "ad"), a.logical.var = c(7L, 
NA, 7L, 4L), another.logical.var = c(0L, 1L, NA, 3L), another.char.var = c("ba", 
"bb", "bc", "bd")), .Names = c("a.char.var", "a.logical.var", 
"another.logical.var", "another.char.var"), class = "data.frame", row.names = c(NA, 
-4L))

Upvotes: 0

Related Questions