Reputation: 329
I have two data frames, data
and meta
. Some, but not all, columns in data
are logical values, but they are coded in many different ways. The rows in meta
describe the columns in data
, indicate whether they are to be interpreted as logicals, and if so, what single value codes TRUE and what single value codes FALSE.
I need a procedure that replaces all data
values in conceptually logical columns with the appropriate logical values from the codes in the corresponding meta
row. Any data
values in a conceptually logical column that do not match a value in the corresponding meta
row should become NA.
Small toy example for meta
:
name type false true
-----------------------------------------
a.char.var char NA NA
a.logical.var logical NA 7
another.logical.var logical 1 0
another.char.var char NA NA
Small toy example for data
:
a.char.var a.logical.var another.logical.var another.char.var
----------------------------------------------------------------
aa 7 0 ba
ab NA 1 bb
ac 7 NA bc
ad 4 3 bd
Small toy example output:
a.char.var a.logical.var another.logical.var another.char.var
----------------------------------------------------------------
aa TRUE TRUE ba
ab FALSE FALSE bb
ac TRUE NA bc
ad NA NA bd
I cannot, for the life of me, find a way to do this in idiomatic R that handles all the corner cases. The data sets are large, so an idiomatic solution would be ideal if possible. I inherited this absolutely insane data management mess and will be grateful to anybody who can help fix it. I am by no means an R guru, but this seems like a deceptively difficult problem.
Upvotes: 1
Views: 48
Reputation: 891
First we set up the data
meta <- data.frame(name=c('a.char.var', 'a.logical.var', 'another.logical.var', 'another.char.var'),
type=c('char', 'logical', 'logical', 'char'),
false=c(NA, NA, 1, NA),
true=c(NA, 7, 0, NA), stringsAsFactors = F)
data <- data.frame(a.char.var=c('aa', 'ab', 'ac', 'ad'),
a.logical.var=c(7, NA, 7, 4),
another.logical.var=c(0,1,NA,3),
another.char.var=c('ba', 'bb', 'bc', 'bd'), stringsAsFactors = F)
Then we subset out just the logical columns. We will iterate through these, using the name
column to select the relevant column in data
, and change values in data_out
from an initialized NA
to either T
or F
according to matching values in data
.
Note that data[,logical_meta$name[1]]
is equivalent to data[,'a.logical.var']
or data$a.logical.var
, if logical_meta$name
is a character. If it's a factor (eg if we didn't specify stringsAsFactors=F
) we need to convert to character at which point we might as well give it a name - colname
below.
Having NAs to contend with means using which
is advantageous: c(0, 1,NA,3)==0
returns T,F,NA,F
but which
then ignores the NA
and returns just the position 1
. Subsetting by a logical vector that includes NAs yields NA rows or columns, using which
eliminates this.
logical_meta <- meta[meta$type=='logical',]
data_out <- data #initialize
for(i in 1:nrow(logical_meta)) {
colname <- as.character(logical_meta$name[i]) #only need as.character if factor
data_out[,colname] <- NA
#false column first
if(is.na(logical_meta$false[i])) {
data_out[is.na(data[,colname]),colname] <- FALSE
} else {
data_out[which(data[,colname]==logical_meta$false[i]),
colname] <- FALSE
}
#true column next
if(is.na(logical_meta$true[i])) {
data_out[is.na(data[,colname]),colname] <- TRUE
} else {
data_out[which(data[,colname]==logical_meta$true[i]),
colname] <- TRUE
}
}
data_out
Upvotes: 1
Reputation: 10771
I've written a function that takes in the column index of data
and tries to perform the operation you described.
The function first selects x
as the column we are interested in. We then match the name of the column in data
to the entries in the first column of meta
, this gives our row of interest.
We then check if the column type is logical
, if it isn't we just return x
, nothing needed to be changed. If the column type is logical
we then check whether its values match the true
or false
columns in meta
.
convert_data <- function(colindex, dat, meta = meta){
x <- dat[,colindex] #select our data vector
#match the column name to the first column in meta
find_in_meta <- match(names(dat)[colindex],
meta[,1])
#what type of column is it
type_col <- meta[find_in_meta,2]
if(type_col != 'logical'){
return(x)
}else{
#fix if logical is NA
true_val <- ifelse(is.na(meta[find_in_meta,4]),'NA_val',
meta[find_in_meta,4])
#fix if logical is NA
false_val <- ifelse(is.na(meta[find_in_meta,3]), 'NA_val',
meta[find_in_meta, 3])
#fix if logical is NA
x <- ifelse(is.na(x), 'NA_val', x)
x <- ifelse(x == true_val, TRUE,
ifelse(x == false_val, FALSE, NA))
return(x)
}
}
We can then use lapply
and a little data manipulation to get it into an acceptable form:
res <- lapply(1:ncol(df1), function(ind)
convert_data(colindex = ind, dat = df1, meta = meta))
setNames(do.call('cbind.data.frame', res), names(df1))
a.char.var a.logical.var another.logical.var another.char.var
1 aa TRUE TRUE ba
2 ab FALSE FALSE bb
3 ac TRUE NA bc
4 ad NA NA bd
meta <- structure(list(name = c("a.char.var", "a.logical.var", "another.logical.var",
"another.char.var"), type = c("char", "logical", "logical", "char"
), false = c(NA, NA, 1L, NA), true = c(NA, 7L, 0L, NA)), .Names = c("name",
"type", "false", "true"), class = "data.frame", row.names = c(NA,
-4L))
df1 <- structure(list(a.char.var = c("aa", "ab", "ac", "ad"), a.logical.var = c(7L,
NA, 7L, 4L), another.logical.var = c(0L, 1L, NA, 3L), another.char.var = c("ba",
"bb", "bc", "bd")), .Names = c("a.char.var", "a.logical.var",
"another.logical.var", "another.char.var"), class = "data.frame", row.names = c(NA,
-4L))
Upvotes: 0