JPMD
JPMD

Reputation: 664

Duplicating rows in R merge function

I am trying to run this merge function in R:

nomes <- '0'
dt <- data.frame("variable" = Level12R$level1.2_are.out$parameters$stdyx.standardized[,2])
for(i in 1:length(Level12R)) {
  nomes[i] = names(Level12R)[i]
  df = eval(parse(text=paste0("Level12R$",nomes[i],"$parameters$stdyx.standardized", collapse=NULL)))
  df <- df[,2:3]
  names(df)<-c("variable",toupper(substr(nomes[i],10,12)))
  dt <- merge(x=dt, y=df, by = "variable", all.x = TRUE)
}

on the following data Level12R.rds. The merge function works, but I get duplicate rows... since the loop goes from 1 to 48 after a few cycles my dt object has millions of observations.

Any idea of what is wrong here? (I also tried dplyr::left_join and the same behavior occurs).

Thanks in advance for any comments.

João

Upvotes: 1

Views: 5822

Answers (3)

JPMD
JPMD

Reputation: 664

Thanks guys! Yes, the problem was with the duplicate names in the df. So, my solution was:

nomes<-'0'
dt<- data.frame("variable" = paste0(Level12R$level1.2_are.out$parameters$stdyx.standardized[,1],
                                Level12R$level1.2_are.out$parameters$stdyx.standardized[,2],
                                Level12R$level1.2_are.out$parameters$stdyx.standardized[,8]))

for (i in 1:length(Level12R)){
    nomes[i] = names(Level12R)[i]
    df = eval(parse(text=paste0("Level12R$",nomes[i],"$parameters$stdyx.standardized", collapse=NULL)))
    df<-df[,2:3]
    names(df)<-c("variable",toupper(substr(nomes[i],10,12)))
    n1 = eval(parse(text=paste0("Level12R$",nomes[i],"$parameters$stdyx.standardized","[1]", collapse=NULL)))
    n2 = eval(parse(text=paste0("Level12R$",nomes[i],"$parameters$stdyx.standardized","[2]", collapse=NULL)))
    n3 = eval(parse(text=paste0("Level12R$",nomes[i],"$parameters$stdyx.standardized","[8]", collapse=NULL)))
    varnames <- paste0(n1[,],n2[,],n3[,])
    df[,1]<-varnames
    dt<-merge(x=dt, y=df, by = "variable", all.x = TRUE)
   }

 ## So now a nice summary per variable
 dt2 <- data.frame(t(dt[-1]))
 colnames(dt2) <- dt[, 1]
 skim(dt2)

Upvotes: 0

moodymudskipper
moodymudskipper

Reputation: 47300

You have to take more columns from your source table to make sure your observations are uniquely identified.

We first build all these data.frames and then join them using Reduce:

dfs <- Map(Level12R, names(Level12R), f = function(x,y){
  df <- x$parameters$stdyx.standardized[c(1:3,8)]
  setNames(df,c("paramHeader","variable",toupper(substr(y,10,12)),"BetweenWithin"))
})

dt <- Reduce(function(x,y) merge(x,y,all.x=TRUE),dfs)

dt[1:5,1:8]
#   paramHeader variable BetweenWithin    ARE    AUS    AUT    AZE    BGR
# 1   ASRREA.ON  ACBG03A       Between -0.060 -0.140 -0.369  0.034 -0.002
# 2   ASRREA.ON ACBG12BD       Between -0.041 -0.108 -0.003 -0.029  0.159
# 3   ASRREA.ON  ACBG13H       Between -0.121 -0.143 -0.106 -0.112 -0.011
# 4   ASRREA.ON  ACBGDAS       Between  0.143  0.112  0.009 -0.053 -0.086
# 5   ASRREA.ON  ACBGEAS       Between  0.031  0.088  0.116 -0.336  0.359

Upvotes: 1

phiver
phiver

Reputation: 23608

The problem you have is that your variables are not unique. If you merge them you will get more and more rows. You should have a look what you get when you do:

dt <- data.frame(level12R$level1.2_are.out$parameters$stdyx.standardized[,1:2])

tail(dt)
            paramHeader          param
30            ASRREA.ON       ATBR10CG
31            ASRREA.ON       ATBR10DG
32            ASRREA.ON       ATBR10FG
33            ASRREA.ON       ATBR12AG
34           Intercepts         ASRREA
35   Residual.Variances         ASRREA

You can see that the last to variables are the same, but come from different headers.

So we have to extend the join so we can make unique records. Looking at the data that would take 3 columns, 1, 2 and 8 "header", "variable" and "betweenwithin". Then we can loop through everything without getting duplicate records. Your dt object ends up with 35 records and 51 variables with NA's where the results were not 35 records but 34 or even 25.

nomes <- '0'
dt <- data.frame(Level12R$level1.2_are.out$parameters$stdyx.standardized[,c(1:2, 8)])
names(dt)<-c("header", "variable", "betweenwithin")
for(i in 1:length(Level12R)) {
  nomes[i] = names(Level12R)[i]
  df = eval(parse(text=paste0("Level12R$",nomes[i],"$parameters$stdyx.standardized", collapse=NULL)))
  df <- df[,c(1:3, 8)]
  names(df)<-c("header", "variable", toupper(substr(nomes[i],10,12)), "betweenwithin")
  dt <- left_join(x=dt, y=df)
}

Normally I would use a list object in a loop, and later on see what I need to do with the data in the list. It prevents creating unintended side effects when using joins / merges etc.

Upvotes: 2

Related Questions