W Hampton
W Hampton

Reputation: 53

Apply factor levels from string version of df to numeric version of df

I have two dataframes. df1 is a dataframe with numeric data. df2 is the same set of observations but is the string version (exactly 1:1 correspondence between all cells). Each dataframe has many columns, with values like 1 to 5 in the numeric dataframe, but the meaning of these numbers is different between columns. In the example below, in df1$X1 3 == "A", while in df1$X2 3 == "K".

The data is too extensive to manually factor and label, so I want use string dataframe to label on the numeric dataframe values.

x<- c(NA,"2","2","3","3","3")
z <- c(NA,"B","J","K","A","K")

z<-matrix(z,nrow=3,ncol=2,byrow=TRUE)
x<-matrix(x,nrow=3,ncol=2,byrow=TRUE)

df1 <- data.frame(x)
df2 <- data.frame(z)

df1[1,2]
df2[1,2]

This is how I would do it manually

df1$X1 <- factor(df1$X1, levels=df1$X1, labels=df2$X1)
df1$X2 <- factor(df1$X2, levels=df1$X2, labels=df2$X2)

...

This was my attempt at a loop that works if there are no NA:

for (c in colnames(df1)){
  df1[,c] <- factor(df1[,c], levels=df1[,c], labels=df2[,c])
  
}

However, as noted, the above that doesn't actually work with NAs in the dataset, it gives an error:

Error in factor(df1[, c], levels = df1[, c], labels = df2[, c]) : 
  invalid 'labels'; length 3 should be 1 or 2

There are many NAs in the dataset because it's a multi-branch survey (some questions only certain groups answered, while others are common across all participants), so I'd rather not go the na.omit route because this will essentially involve creating independent na.omit datasets for every analysis I need to do.

Upvotes: 2

Views: 44

Answers (1)

r2evans
r2evans

Reputation: 160407

I think we can use lapply for this, and reassign back into df1.

df1[] <- lapply(names(df1),
                function(nm) factor(df1[[nm]], levels = df1[[nm]], labels = df2[[nm]]))
df1
#   X1 X2
# 1  A  B
# 2  J  K
# 3  A  K
str(df1)
# 'data.frame': 3 obs. of  2 variables:
#  $ X1: Factor w/ 2 levels "A","J": 1 2 1
#  $ X2: Factor w/ 2 levels "B","K": 1 2 2

(The brackets are needed in df1[] <- because lapply is returning a list, not a data.frame. If we did df <-, then we would replace the object pointed to by the df1 symbol with a new one, losing the frame-like properties. By using df1[] <-, we replace the contents of df1 with the return from lapply, while keeping the frame attributes of df1. This works well because a data.frame is effectively a list with names and all elements being the same length ... that's a bit reductive, but sufficient I think for visualizing how this works.)

Upvotes: 2

Related Questions