Reputation: 365
I have a dataset with continuous variables and two categorical ID variables. I would like to scale my data such that the continuous variables have the value of one for a a certain value of each ID. I try to implement it as a function in R so that I can avoid repeated coding.
Moreover, I have an example of this rescaling implemented in Stata:
gen value_var_i_k= value_var if ID1=="15t16" & ID2 =="AUS"
egen value_var_i_k_m = mean(value_var_i_k)
drop value_var_i_k
rename value_var_i_k_m value_var_i_k
gen value_var_k= value_var if ID1 =="15t16"
bys ID2: egen value_var_k_m = mean(value_var_k)
drop value_var_k
rename value_var_k_m value_var_k
gen value_var_i = value_var if ID2=="AUS"
bys ID1: egen value_var_i_m = mean(value_var_i)
drop value_var_i
rename value_var_i_m value_var_i
gen value_var_i_k_norm= value_var * value_var_i_k/(value_var_i*value_var_k)
My attempt in R to create the value_var_i value_var_k
and value_var_i_k
variables with new variables, which exits with an error (not applicable method for class "character"):
library(dplyr)
library(magrittr)
normalize<-function(var,data,i,k) {
varname <- paste("value", var , sep="_")
#Id columns and define variables to select
col1<-"ID1"
col2<-"ID2"
select_variables<-c(col2,col1,varname)
#name of the output variables
name_ik<-paste(i,k,sep="_")
name.ik<-paste(name_ik, "df", sep="_")
name.i<-paste(i, "df", sep="_")
name.k<-paste(k, "df", sep="_")
#my attempt to replicate the Stata code with dplyr
data %>% filter_(as.name(col1)==as.name(i) & as.name(col2)==as.name(k)) %>% select_( .dots=select_variables) %$% as.name( name.ik)
data %>% filter_(as.name(col1)==as.name(i)) %>% select_( .dots = select_variables ) %>% group_by_(as.name(col2)) %>%transform( interp(~mean(b, na.rm = TRUE),b=as.name(varname) )) %$% as.name(name.i)
data %>% filter_(as.name(col2)==as.name(k)) %>% select_( .dots = select_variables ) %>% group_by_(as.name(col1)) %>%transform( interp(~mean(b, na.rm = TRUE),b=as.name(varname) )) %$% as.name(name.k)
norm <- data[eval(substitute(varname)]*as.name(name.ik)/ ( as.name(name.i) * as.name(name.k) )
}
Update II: A minimal working example with the computation steps:
The variables value_var
are the same in both tables. The Stata code replaces the value_var_k
and value_var_i
variables with the variables value_var_i_m
and value_var_k_m
.
Then value_var
is transformed.
value_var ID1 ID2 value_var_i_k value_var_k value_var_k_m
1.154662 15t16 AUS 1.154662 1.154662 1.154662
1.070471 17t18 AUS 1.154662 . 1.154662
0.9643197 19 AUS 1.154662 . 1.154662
1.036398 20 AUS 1.154662 . 1.154662
1.084701 21t22 AUS 1.154662 . 1.154662
1.463215 15t16 AUT 1.154662 1.463215 1.463215
1.431824 17t18 AUT 1.154662 . 1.463215
1.276983 19 AUT 1.154662 . 1.463215
1.441925 20 AUT 1.154662 . 1.463215
1.506117 21t22 AUT 1.154662 . 1.463215
1.589491 15t16 BEL 1.154662 1.589491 1.589491
1.540076 17t18 BEL 1.154662 . 1.589491
1.188218 19 BEL 1.154662 . 1.589491
1.386074 20 BEL 1.154662 . 1.589491
1.48204 21t22 BEL 1.154662 . 1.589491
value_var ID1 ID2 value_var_i value_var_i_m
1.154662 15t16 AUS 1.154662 1.154662
1.589491 15t16 BEL . 1.154662
1.463215 15t16 AUT . 1.154662
1.070471 17t18 AUS 1.070471 1.070471
1.540076 17t18 BEL . 1.070471
1.431824 17t18 AUT . 1.070471
0.9643197 19 AUS 0.9643197 0.9643197
1.276983 19 AUT . 0.9643197
1.188218 19 BEL . 0.9643197
1.036398 20 AUS 1.036398 1.036398
1.441925 20 AUT . 1.036398
1.386074 20 BEL . 1.036398
1.084701 21t22 AUS 1.084701 1.084701
1.506117 21t22 AUT . 1.084701
1.48204 21t22 BEL . 1.084701
The normalized value var is as follows (normaliziation for ID1 level "15t16" and ID2 level "AUS" ) :
ID1 ID2 value_var_i_k_norm
AUS 15t16 1
AUS 17t18 1
AUS 19 1
AUS 20 1
AUS 21t22 1
AUT 15t16 1
AUT 17t18 1.055508
AUT 19 1.044988
AUT 20 1.097901
AUT 21t22 1.09571
BEL 15t16 1
BEL 17t18 1.045116
BEL 19 .8951011
BEL 20 .9715319
BEL 21t22 .9925373
Update: To make the normalization (or scaling) steps more clear I show here the pre after normalized data in a wide format.
First I start from the following wide data
Row-/Colnames 15t16 17t18 19t 20t 21t22
AUS 1.154662 1.070471 0.9643197 1.036398 1.084701
AUT 1.463215 1.431824 1.276983 1.441925 1.506117
BEL 1.589491 1.540076 1.188218 1.386074 1.48204
I normalize the matrix to the row with the values for AUS and the column with the values of 15t16. So, I would obtain
Row-/Colnames 15t16 17t18 19t 20t 21t22
AUS 1 1 1 1 1
AUT 1 1.055508 1.044988 1.097901 1.09571
BEL 1 1.045116 .8951011 .9715319 .9925373
Upvotes: 1
Views: 1505
Reputation: 365
I now see that my attempt to translate the Stata code led me astray in R. The normalization variables are simply the results of filtering the original data frame with respect to the specific ID value and in the next step I inner join them with the original data frame.
normalize<-function(data,var,col1,col2,i,k) {
varname <- paste("z.ik", var , sep="_")
var.i<-paste(var,i, sep="_")
var.k<-paste(var,k, sep="_")
var.ik<-paste(var.i,k, sep="_")
varname_i <- paste("z", var.i , sep="_")
varname_k <- paste("z", var.k , sep="_")
varname_ik <- paste("z", var.ik , sep="_")
d2<-select_(data, varname, col1, col2)
filter_crit2 = interp(~ filter_var1 %in% i & filter_var2 %in% k ,.values = list(filter_var1 = as.name(col1),filter_var2 = as.name(col2)))
filter_crit_k = interp(~ filter_var2 %in% k ,.values = list(filter_var2 = as.name(col2)))
filter_crit_i = interp(~ filter_var1 %in% i ,.values = list(filter_var1 = as.name(col1)))
select_variables.k<-c(col2,varname)
select_variables.i<-c(col1,varname)
results_ik<- data %>% filter_( filter_crit2 ) %>% select_( .dots=varname ) %>%rename_(.dots=setNames( varname, varname_ik))
results_ik<-cbind( results_ik ,d2)
results_i<- data %>% filter_( filter_crit_i ) %>% select_( .dots=select_variables.k) %>%rename_(.dots=setNames( varname, varname_i)) %>%inner_join( ., results_ik)
results_k<- data %>% filter_( filter_crit_k ) %>% select_( .dots=select_variables.i) %>%rename_(.dots=setNames( varname, varname_k)) %>% inner_join( ., results_i)
mutate_fn <- function(d_in, varval, varname_norm){
d_out = d_in %>%
mutate_(.dots = setNames( varval, varname_norm))
}
d_in=results_k
varname_norm<- paste("z", var, sep="_")
varname_norm <- paste(varname_norm,"norm", sep="_")
varval <- lazyeval::interp(~ var1* var2 / (var3 * var4),.values= list( var1=as.name(varname) , var2=as.name(varname_ik), var3=as.name(varname_i), var4=as.name(varname_k)))
data.norm = mutate_fn(d_in,varval,varname_norm )
n<-length(data.norm)
names(data.norm)[n]<-varname_norm
data.norm <-select_( data.norm,.dots=c(col1,col2,varname,varname_norm))
}
Upvotes: 0
Reputation: 37208
This is too long to fit as a comment. But I can cut down your Stata code from 13 lines to 5, and I hope thereby make what you are doing clearer, to people using Stata or R or both. I keep your code blocks and add comments.
* no need to use -egen- to create a single mean; use -summarize- results
summarize value_var if ID1=="15t16" & ID2=="AUS", meanonly
gen value_var_i_k = r(mean)
* evaluating ID1 == "15t16" yields 1 or 0; dividing by 0 yields missing values
* which are ignored, which is what you want
bys ID2: egen value_var_k = mean(value_var / (ID1=="15t16"))
* same device as previous block
bys ID1: egen value_var_i = mean(value_var / (ID2=="AUS"))
gen value_var_i_k_norm = value_var * value_var_i_k/(value_var_i * value_var_k)
In fact, we could cut down further, but this version is a little less clear:
bys ID2: egen value_var_k = mean(value_var / (ID1=="15t16"))
bys ID1: egen value_var_i = mean(value_var / (ID2=="AUS"))
summarize value_var if ID1=="15t16" & ID2=="AUS", meanonly
gen value_var_i_k_norm = value_var * r(mean)/(value_var_i * value_var_k)
For the division by zero device, see this article, section 10. An alternative way to ignore certain observations is
bys ID1: egen value_var_i = mean(cond(ID2=="AUS", value_var, .))
which is discussed in the same article, section 9.
Either way, a key detail is that the mean()
function of egen
ignores missing values.
Upvotes: 1
Reputation: 4907
So, if I understand you correctly based on your edit.
This is your "normalization" routine:
If (ID1 == "AUS" | ID2 == "15t16") {value_var == 1}
else {value_var unchanged}
That's easily accomplished via ifelse
(assuming input data.frame
named df
):
df$value_var <- ifelse(df$ID1 == "AUS" | df$ID2 == "15t16", 1, df$value_var)
Again, I'm loath to call this "normalization", but it does what you describe. Typically normalization, in a statistical context, means scaling--i.e. subtracting the mean and dividing by the SD. But you don't appear to be doing that here.
Upvotes: 2