Peter King
Peter King

Reputation: 101

In R how do you factorise and add label values to specific data.table columns, using a second file of meta data?

This is part of a project to switch from SPSS to R. While there are good tools to import SPSS files into R (expss) what this question is part of is attempting to get the benefits of SPSS style labeling when data originates from CSV sources. This is to help bridge the staff training gap between SPSS and R by providing a common format for data.tables irrespective of file format origin.

Whilst CSV does a reasonable job of storing data it is hopeless for providing meaningful data. This inevitably means variable and factor levels and labels have to come from somewhere else. In most short examples of this (e.g. in documentation) it is practical to simply hard code the meta data in. But for larger projects it makes more sense to store this meta data in a second csv file.

Example data file

ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten 1,1,34,1,,1,,1,1,4, 2,1,21,0,1,,1,3,14,3,2 3,1,54,1,,,1,3,6,4,4 4,2,32,1,1,1,,3,7,4, 5,3,66,0,,,1,3,9,3,3 6,2,43,1,,1,,1,12,2,1 7,2,26,0,,,1,2,11,1, 8,3,,1,1,,,2,15,1,4 9,1,34,1,,1,,1,12,3,4 10,2,46,0,,,,3,13,2, 11,3,39,1,1,1,,3,7,1,2 12,1,28,0,,,1,1,6,5,1 13,2,64,0,,1,,2,11,,3 14,3,34,1,1,,,3,10,1,1 15,1,52,1,,1,1,1,8,6,

Example metadata file

Rowlabels,ID,varone,vartwo,varthree,varfour,varfive,varsix,varseven,vareight,varnine,varten varlabel,,Question one,Question two,Question three,Question four,Question five,Question six,Question seven,Question eight,Question nine,Question ten varrole,Unique,Attitude,Unique,Filter,Filter,Filter,Filter,Attitude,Filter,Attitude,Attitude Missing,Error,Error,Ignored,Error,Unchecked,Unchecked,Unchecked,Error,Error,Error,Ignored vallable,,One,,No,Checked,Checked,Checked,x,One,A,Support vallable,,Two,,Yes,,,,y,Two,B,Neutral vallable,,Three,,,,,,z,Three,C,Oppose vallable,,,,,,,,,Four,D,Dont know vallable,,,,,,,,,Five,E, vallable,,,,,,,,,Six,F, vallable,,,,,,,,,Seven,G, vallable,,,,,,,,,Eight,, vallable,,,,,,,,,Nine,, vallable,,,,,,,,,Ten,, vallable,,,,,,,,,Eleven,, vallable,,,,,,,,,Twelve,, vallable,,,,,,,,,Thirteen,, vallable,,,,,,,,,Fourteen,, vallable,,,,,,,,,Fifteen,,

SO the common elements are the column names which are the key to both files

The first column of the metadata file describes the role of the row for the data file so varlabel provides the variable label for each column varrole describes the analytic purpose of the variable missing describes how to treat missing data varlabel describes the label for a factor level starting at one on up to as many labels as there are.

Right! Here's the code that works:

```#Libraries
library(expss)
library(data.table)
library(magrittr)```
readcsvdata <- function(dfile)
{
# TESTED - Working
print("OK Lets read some comma separated values")
rdata <- fread(file = dfile, sep = "," , quote = "\"" , header = TRUE, stringsAsFactors = FALSE, 
na.strings = getOption("datatable.na.strings",""))
return(rdata)
}
rawdatafilename <- "testdata.csv"
rawmetadata <- "metadata.csv"

mdt <- readcsvdata(rawmetadata)
rdt <- readcsvdata(rawdatafilename)
names(rdt)[names(rdt) == "ï..ID"] <- "ID" # correct minor data error
commonnames <- intersect(names(mdt),names(rdt))  # find common variable names so metadata applies 
commonnames <- commonnames[-(1)] # remove ID
qlabels <- as.list(mdt[1, commonnames, with = FALSE])

(Here I copy the rdt datatable simply so I can roll back to the original data without re-running the previous read chunks and tidying whenever I make changes that don't work out.

# set var names to columns
for (each_name in commonnames) # loop through commonnames and qlabels
{  
    expss::var_lab(tdt[[each_name]]) <- qlabels[[each_name]]
}

OK this is where I fall down. Failure from here

factorcols <- as.vector(commonnames)  # create a vector of column names (for later use)
for (col in factorcols) 
  { 
  print( is.na(mdt[4, ..col])) # print first row of value labels (as test)
  if (is.na(mdt[4, ..col])) factorcols <- factorcols[factorcols != col] 
# if not a factor column, remove it from the factorcol list and dont try to factor it
  else {  # if it is a vector factorise
    print(paste("working on",col))  # I have had a lot of problem with unrecognised ..col variables
    tlabels <- as.vector(na.omit(mdt[4:18, ..col]))  # get list of labels from the data column}
    validrange <- seq(1,lengths(tlabels),1)            # range of valid values is 1 to the length of labels list
    print(as.character(tlabels)) # for testing
    print(validrange) # for testing
    tdt[[col]] <- factor(tdt[[col]], levels = validrange, ordered = is.ordered(validrange), labels = as.character(tlabels))
    # expss::val_lab(tdt[, ..col]) <- tlabels
    tlabels = c()  # flush loop variable
    validrange = c() # flush loop variable
  }
}

So the problem is revealed here when we check the data table.

tdt

the labels have been applied as whole vectors to each column entry except where there is only one value in the vector ("checked" for varfour and varfive)
tdt
id (int) 1
varone (fctr) c("One", "Two", "Three") 1 (should be "One" 1)
vartwo (S3: labelled) 34
varthree (fctr) c("No", "Yes") 1 (should be "No" 1)
varfour (fctr) NA
varfive (fctr) Checked

And a mystery
this code works just fine on a single columns when I don't use a for loop variable

# test using column name
tlabels <- c("one","two","three")
validrange <- c(1,2,3)
factor(tdt[,varone], levels = validrange, ordered=is.ordered(validrange), labels = tlabels)

Upvotes: 1

Views: 436

Answers (1)

Gregory Demin
Gregory Demin

Reputation: 4846

It seems the issue is in the line tlabels <- as.vector(na.omit(mdt[4:18, ..col])). It doesn't make vector as you expect. Contrary to usual data.frame data.table doesn't drop dimensions when you provide single column in the index. And as.vector do nothing with data.frames/data.tables. So tlabels remains data.table. This line need to be rewritten as tlabels <- na.omit(mdt[[col]][4:18]). Example:

library(data.table)
mdt = as.data.table(mtcars)
col = "am"
tlabels <- as.vector(na.omit(mdt[3:6, ..col])) # ! tlabels is data.table 
str(tlabels)
# Classes ‘data.table’ and 'data.frame':    4 obs. of  1 variable:
#     $ am: num  1 0 0 0
# - attr(*, ".internal.selfref")=<externalptr> 

as.character(tlabels) # character vector of length 1
# [1] "c(1, 0, 0, 0)"

tlabels <- na.omit(mdt[[col]][3:6]) # vector 
str(tlabels)
# num [1:4] 1 0 0 0
as.character(tlabels) # character vector of length 4
# [1] "1" "0" "0" "0"

Upvotes: 1

Related Questions