CBechet
CBechet

Reputation: 171

Multiple Correspondence Analysis on longitudinal data

I would like to explore the profile of two modalities of a categorical variable over time with respect to a given set of other categorical variables. I paste a reproducible example of such a dataset below.

set.seed(90114)
V1<-sample(rep(c("a", "A"), 100))
V2<-sample(rep(c("a", "A", "b", "B"), 50))
V3<-sample(rep(c("F", "M", "I"), 67), 200)
V4<-sample(rep(c("C", "R"), 100))
V5<-sample(rep(c(1970, 1980, 1990, 2000, 2010), 40))
data<-data.frame(V1, V2, V3, V4, V5)

To explore the behavior of such modalities, I decided to use Multiple Correspondence Analysis (package FactoMineR). To account for variation over time, one possibility is to split the dataset into 5 subsamples which represent the different levels of V5 and then run MCA on each subset. The rest of the analysis consists in comparing the position of the modalities across the different biplots. However, such practice is not without problems if the original dataset is too small. In such a case, the dimensions could be flipped or worse, the location of the active variables are likely to change from one plot to the other.

To avoid the problem, one solution could be to stabilize the position of the active variables across all the subsets and predict the coordinates of the supplementary variable afterwards, allowing the latter to move over time. I read somewhere that the coordinates of a modality can be obtained by computing the weighted mean of the coordinates of individuals in which this modality is found. So finding the coordinates of a modality for the year 1970 would boil down to computing the weighted mean of the coordinates of the individuals in the 1970 subset for that modality. However, I don't know whether it's common practice and if yes, I just don't know how to implement such calculations. I paste the rest of the code in order for you to visualize the problem.

data.mca<-MCA(data[, -5], quali.sup=1, graph=F)

# Retrieve the coordinates of the first and second dimension

DIM1<-data.mca$ind$coord[, 1]
DIM2<-data.mca$ind$coord[, 2]

# Append the coordinates to the original dataframe

data1<-data.frame(data, DIM1, DIM2)

# Split the data into 5 clusters according to V5 ("year")

data1.split<-split(data1, data1$V5)
data1.split<-lapply(data1.split, function(x) x=x[, -5]) # to remove the fifth column with the years, no longer needed
seventies<-as.data.frame(data1.split[1])
eightties<-as.data.frame(data1.split[2])
# ...

a.1970<-seventies[seventies$X1970.V1=="a",]
A.1970<-seventies[seventies$X1970.V1=="A",]

# The idea, then, is to find the coordinates of the modalities "a" and "A" by computing the weighted mean of their respective indivuduals for each subset. The arithmetic mean would yield

# a.1970.DIM1<-mean(a.1970$X1970.DIM1) # 0.0818
# a.1970.DIM2<-mean(a.1970$X1970.DIM2) # 0.1104

# and so on for the other levels of V5.

I thank you in advance for your help!

Upvotes: 1

Views: 549

Answers (1)

CBechet
CBechet

Reputation: 171

I found a solution to my problem. We can simply weight the mean of the coordinates by the value returned by row.w in FactoMineR. To account for the dilatation of the MCA, the values of the resulting coordinates of the barycentres should be divided by the square root of the eigenvalue of the dimension.

DIM1<-data.mca$ind$coord[, 1]
DIM2<-data.mca$ind$coord[, 2]
WEIGHT<-data.mca$call$row.w
data1<-data.frame(data, WEIGHT, DIM1, DIM2)

# Splitting the dataset according to values of V1

v1_a<-data1[data1$V1=="a",]
v1_A<-data1[data1$V1=="A",]

# Computing the weighted average of the coordinates of Dim1 and Dim2 for the first category of V1

V1_a_Dim1<-sum(v1_a$WEIGHT*v1_a$DIM1)/100 # -0.0248
v1_a_Dim2<-sum(v1_a$WEIGHT*v1_a$DIM2)/100 # -0.0382

# Account for the dilatation of the dimensions...

V1_a_Dim1/sqrt(data.mca$eig[1,1])
[1] -0.03923839
v1_a_Dim2/sqrt(data.mca$eig[2,1])
[1] -0.06338353

# ... which is the same as the following:

categories<-data.mca$quali.sup$coord[, 1:2]
categories
#            Dim 1       Dim 2
# V1_a -0.03923839 -0.06338353
# V1_A  0.03923839  0.06338353

This can be applied to different partitions of the data according to V5 or any other categorical variable.

Upvotes: 1

Related Questions