Reputation: 593
I guess I am writing in the spirit of "no question is too easy", I am just an average Stata-user social scientist approaching R for the first time and having endless nights facing it... Please have mercy!
I am working with a comparative dataset from 20 countries (about 20,000 observations, quite balanced across countries).
I have to perform a set of quite computationally-intense MCMC simulations and I have thus decided to split the df into a list containing 20 (country-specific) df, and proceed with lapply()
. (I read that it is more efficient to avoid for
loops on R, right?)
My most immediate problem is that I am unable to preprocess the elements within the various df contained in the list. In particular, I have to recode a set of 15 variables, these are integers ranging from 0 to 10 that includes SPSS typical values for missing cases: 77 88, 89, 99, 999
. I want to recode these values to NA
and then do some little additional transformations: center on 0, define two df objects T
and TT
with two different sets of variables to be later used in the simulations. This task has to be repeated across 20 different country-specific list elements that compose the "master" list "ees2009split".
ees2009split <- vector("list", 20)
ees2009split <- split(ees2009, ees2009$t102) # t102 is the country identifier
names(ees2009split) <- country.names[1:2] # rename list objects with country names
So here is my list (sorry I am unable to provide a reproducible example):
> str(ees2009split)
List of 20
$ Austria :'data.frame': 1000 obs. of 17 variables:
..$ t102 : int [1:1000] 1040 1040 1040 1040 1040 1040 1040 1040 1040 1040 ...
..$ q46 : int [1:1000] 77 2 5 5 5 77 5 5 5 77 ...
..$ q47_p1 : int [1:1000] 77 3 5 4 77 77 5 1 89 77 ...
..$ q47_p2 : int [1:1000] 77 8 7 6 77 77 5 6 5 77 ...
..$ q47_p3 : int [1:1000] 77 10 10 9 77 77 5 7 7 77 ...
..$ q47_p4 : int [1:1000] 77 10 9 8 77 77 5 7 4 77 ...
..$ q47_p5 : int [1:1000] 77 2 5 3 77 77 5 1 3 77 ...
..$ q47_p6 : int [1:1000] 77 4 89 5 77 77 89 2 89 77 ...
..$ q47_p7 : int [1:1000] 77 3 89 6 77 77 89 3 5 77 ...
..$ q47_p8 : int [1:1000] 77 1 0 0 77 77 5 0 89 77 ...
..$ q47_p9 : int [1:1000] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p10: int [1:1000] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p11: int [1:1000] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p12: int [1:1000] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p13: int [1:1000] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p14: int [1:1000] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p15: int [1:1000] 99 99 99 99 99 99 99 99 99 99 ...
$ Belgium :'data.frame': 1002 obs. of 17 variables:
..$ t102 : int [1:1002] 1056 1056 1056 1056 1056 1056 1056 1056 1056 1056 ...
..$ q46 : int [1:1002] 5 0 77 88 77 88 5 2 77 5 ...
..$ q47_p1 : int [1:1002] 88 5 77 77 6 77 5 77 5 77 ...
..$ q47_p2 : int [1:1002] 88 10 77 77 8 77 89 77 10 77 ...
..$ q47_p3 : int [1:1002] 88 7 77 77 5 77 3 77 0 77 ...
..$ q47_p4 : int [1:1002] 88 10 77 77 10 77 10 77 10 77 ...
..$ q47_p5 : int [1:1002] 88 0 77 77 4 77 4 77 5 77 ...
..$ q47_p6 : int [1:1002] 99 99 77 99 99 77 99 77 99 99 ...
..$ q47_p7 : int [1:1002] 99 99 77 99 99 77 99 77 99 99 ...
..$ q47_p8 : int [1:1002] 99 99 88 99 99 77 99 77 99 99 ...
..$ q47_p9 : int [1:1002] 99 99 77 99 99 77 99 77 99 99 ...
..$ q47_p10: int [1:1002] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p11: int [1:1002] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p12: int [1:1002] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p13: int [1:1002] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p14: int [1:1002] 99 99 99 99 99 99 99 99 99 99 ...
..$ q47_p15: int [1:1002] 99 99 99 99 99 99 99 99 99 99 ...
etc... until country 20.
I defined two functions to be called with lapply()
, the functions rename()
and recode()
:
rename <- function(x) {
# renaming
names(x) <- gsub("q46", "lr.self", names(x))
names(x) <- gsub("q47_p", "lr.p", names(x))
return(x)
}
So far so good:
> processed.dat <- lapply(ees2009split, renaming)
> str(processed.dat)
List of 20
$ Austria :'data.frame': 1000 obs. of 17 variables:
..$ t102 : int [1:1000] 1040 1040 1040 1040 1040 1040 1040 1040 1040 1040 ...
..$ lr.self: int [1:1000] 77 2 5 5 5 77 5 5 5 77 ...
..$ lr.p1 : int [1:1000] 77 3 5 4 77 77 5 1 89 77 ...
# I omit the rest...
With the recoding function I am having hard time instead:
recoding <- function(x){
# recode missing values
x$lr.self[lr.self %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p1[lr.p1 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p2[lr.p2 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p3[lr.p3 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p4[lr.p4 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p5[lr.p5 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p6[lr.p6 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p7[lr.p7 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p8[lr.p8 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p9[lr.p9 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p10[lr.p10 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p11[lr.p11 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p12[lr.p12 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p13[lr.p13 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p14[lr.p14 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$lr.p15[lr.p15 %in% c(77, 88, 89, 98, 99, 999)] <- NA
x$T <- cbind(lr.self, lr.p1, lr.p2, lr.p3, lr.p4, lr.p5, lr.p6, lr.p7, lr.p8, lr.p9, lr.p10, lr.p11, lr.p12, lr.p13, lr.p14, lr.p15)
T <- T - 5 # centering on 0
lrself.resc <- T[,1] # rescaled lr.self
TT <- T[,-1] # whole matrix rescaled
N <- nrow(TT)
q <- ncol(TT)
z <- TT
x$dat.list <- list(lr.self=lr.self, lr.p1=lr.p1, lr.p2=lr.p2, lr.p3=lr.p3, lr.p4=lr.p4, lr.p5=lr.p5, lr.p6=lr.p6, lr.p7=lr.p7, lr.p8=lr.p8, lr.p9=lr.p9, lr.p10=lr.p10, lr.p11=lr.p11, lr.p12=lr.p12, lr.p13=lr.p13, lr.p14=lr.p14, lr.p15=lr.p15, T=T, TT=TT, lrself.resc, N=N, q=q, z=z)
return(x$dat.list)
}
This is the output:
> processed.dat <- lapply(ees2009split, recoding)
Error in match(x, table, nomatch = 0L) : object 'lr.self' not found
Called from: FUN(X[[1L]], ...)
Browse[1]>
1) How should I recode variables within a data frame that in contained in a list with lapply()
? more broadly, how do I insert objects within the country df within the function?
2) On a more general stance, it is correct this way of proceeding? Splitting, defining task-specific functions, call them with lapply()
, and finally recombine?
Thank you for any suggestion or comment. Andrea
Upvotes: 2
Views: 1220
Reputation: 3634
This should do it for the data cleaning. I use the library gdata
which you'll probably have to install with this command: install.packages('gdata')
. In it you will find a most useful function, namely unknownToNA()
. See the example below.
As I said, I prefer to do the cleaning before I split up the data. I took the liberty of using the EES 2009 dataset as well:
library(foreign)
library(gdata)
#setwd("/Data/sample")
#list.files()
mydata <- read.dta("ZA5055_v1-1-0.dta")
keepvars <- grep("^q46|^q47|^t102",names(mydata), value=T)
mydata2 <- subset(mydata, select=keepvars)
rm(mydata)
str(mydata2)
head(mydata2)
naval <- c(77, 88, 89, 99, 999)
mydata3 <- unknownToNA(mydata2, unknown=list(.default=naval))
head(mydata3)
# t102 q46 q47_p1 q47_p2 q47_p3 q47_p4 q47_p5 q47_p6 q47_p7 q47_p8 q47_p9 q47_p10 q47_p11 q47_p12 q47_p13
# 1 Austria NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 Austria 2 3 8 10 10 2 4 3 1 NA NA NA NA NA
# 3 Austria 5 5 7 10 9 5 NA NA 0 NA NA NA NA NA
# 4 Austria 5 4 6 9 8 3 5 6 0 NA NA NA NA NA
# 5 Austria 5 NA NA NA NA NA NA NA NA NA NA NA NA NA
# 6 Austria NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# q47_p14 q47_p15
# 1 NA NA
# 2 NA NA
# 3 NA NA
# 4 NA NA
# 5 NA NA
# 6 NA NA
If you prefer to split first for some reason, here you go:
library(gdata)
ees2009split <- split(mydata2, mydata2$t102)
ees2009split <- unknownToNA(ees2009split, unknown=list(.default=list(naval)))
head(ees2009split[[1]])
t102 q46 q47_p1 q47_p2 q47_p3 q47_p4 q47_p5 q47_p6 q47_p7 q47_p8 q47_p9 q47_p10 q47_p11 q47_p12 q47_p13
1 Austria NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 Austria 2 3 8 10 10 2 4 3 1 NA NA NA NA NA
3 Austria 5 5 7 10 9 5 NA NA 0 NA NA NA NA NA
4 Austria 5 4 6 9 8 3 5 6 0 NA NA NA NA NA
5 Austria 5 NA NA NA NA NA NA NA NA NA NA NA NA NA
6 Austria NA NA NA NA NA NA NA NA NA NA NA NA NA NA
q47_p14 q47_p15
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
I'm afraid I don't understand your next steps enough to help further.
But generally for scaling I use the scale
function, which centers on 0 and normalizes:
head(scale(mydata3[,-1]))
Upvotes: 2