How can you create Relative Frequency Sequence Plots with (Sampling) Weights and Groups using TraMineR?

Question

Fasang/Liao(2014) have presented Relative Frequency Sequence Plots (RFSP) as a smoothing method for sequence plots:

A RFSP orders a sequence object according to cluster quality measure or MDS dimension.
It splits the sequence object into k subgroups.
It calculates the medoid for each subgroup.
Only the medoid sequences are then drawn in a sequence index plot.
Box-plots for each subgroup are presented alongside the sequence index plots showing the dissimiliarity to the medoid in each subgroup.

To show the difference, you find a sequence index plot (sorted by first MDS dimension) and RFSP below:

The big advantage of RFSPs is a further reduction of information which avoids "overplotting" in sequence index plots with many sequences while presenting goodness of fit statistics about this reduction.

The original article from Fasang/Liao does not mention weights, but it produces RFSPs for two groups (East/West Germany) from the same dataset. The function seqplot.rf from the R-package TraMineRextras can produce RFSPs. But it does neither allow the use of weights or the differentiation for groups. As weights are quite common and one often needs to control for different groups in the sample (e.g. women/men, young/old, cluster from prior sequence analysis), I am trying to find a suitable way to implement weights and groups.

Here is a working example using code from seqplot.rf, weights and groups are not used yet:

library(TraMineR)
library(TraMineRextras)

# Define Sequence Object --------------------------------------------------
data(mvad)
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school",
                   "training")
mvad.labels <- c("Employment", "Further Education", "Higher Education",
                 "Joblessness", "School", "Training")
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")

seqdata <- seqdef(mvad[, 17:86], alphabet = mvad.alphabet, 
                  states = mvad.scodes, labels = mvad.labels)
                  #weights = mvad$weight)



# Calculate distance and define settings ----------------------------------
diss <- seqdist(seqdata, method="HAM") # Use Hamming Distance as example 
k=100

sortv=NULL
use.hclust=FALSE
hclust_method="ward.D"
use.quantile=FALSE
yaxis=FALSE
main=NULL

# Code from seqplot.rf -----------------------------------------------------
message(" [>] Using k=", k, " frequency groups")

#Extract medoid, possibly weighted
gmedoid.index <- disscenter(diss, medoids.index="first")

gmedoid.dist <-diss[, gmedoid.index] #Extract distance to general medoid

##Vector where distance to k medoid will be stored
kmedoid.dist <- rep(0, nrow(seqdata))
#index of the k-medoid for each sequence
kmedoid.index <- rep(0, nrow(seqdata))
#calculate qij - distance to frequency group specific medoid within frequency group
if(is.null(sortv) && !use.hclust){
  sortv <- cmdscale(diss, k = 1)

}
if(!is.null(sortv)){
  ng <- nrow(seqdata) %/% k
  r <- nrow(seqdata) %% k
  n.per.group <- rep(ng, k)
  if(r>0){
    n.per.group[order(runif(r))] <- ng+1
  }
  mdsk <- rep(1:k, n.per.group)
  mdsk <- mdsk[rank(sortv, ties.method = "random")]
}else{
  hh <- hclust(as.dist(diss), method=hclust_method)
  mdsk <- factor(cutree(hh, k))
  medoids <- disscenter(diss, group=mdsk, medoids.index="first")
  medoids <- medoids[levels(mdsk)]
  #ww <- xtabs(~mdsk)
  mds <- cmdscale(diss[medoids, medoids], k=1)
  mdsk <- as.integer(factor(mdsk, levels=levels(mdsk)[order(mds)]))
}
kun <- length(unique(mdsk))
if(kun!=k){
  warning(" [>] k value was adjusted to ", kun)
  k <- kun
  mdsk <- as.integer(factor(mdsk, levels=sort(unique(mdsk))))
}
#sortmds.seqdata$mdsk<-c(rep(1:m, each=r+1),rep({m+1}:k, each=r))
##pmdse <- 1:k
#pmdse20<-1:20

##for each k
for(i in 1:k){
  ##Which individuals are in the k group
  ind <- which(mdsk==i)
  if(length(ind)==1){
    kmedoid.dist[ind] <- 0
    ##Index of the medoid sequence for each seq
    kmedoid.index[ind] <- ind
  }else{
    dd <- diss[ind, ind]
    ##Indentify medoid
    kmed <- disscenter(dd, medoids.index="first")
    ##Distance to medoid for each seq
    kmedoid.dist[ind] <- dd[, kmed]
    ##Index of the medoid sequence for each seq
    kmedoid.index[ind] <- ind[kmed]
  }
  ##Distance matrix for this group

}

##Attribute to each sequences the medoid sequences
seqtoplot <- seqdata[kmedoid.index, ]

##Correct weights to their original weights (otherwise we use the medoid weights)
attr(seqtoplot, "weights") <- NULL
opar <- par(mfrow=c(1,2), oma=c(3,0,(!is.null(main))*3,0), mar=c(1, 1, 2, 0))
on.exit(par(opar))
seqIplot(seqtoplot, withlegend=FALSE, sortv=mdsk, title="Sequences medoids")
##seqIplot(seqtoplot, withlegend=FALSE, sortv=mdsk)
heights <- xtabs(~mdsk)/nrow(seqdata)
at <- (cumsum(heights)-heights/2)/sum(heights)*length(heights)
if(!yaxis){
  par(yaxt="n")
}

boxplot(kmedoid.dist~mdsk, horizontal=TRUE, width=heights, frame=FALSE,  
        main="Dissimilarities to medoid", ylim=range(as.vector(diss)), at=at)

#calculate R2
R2 <-1-sum(kmedoid.dist^2)/sum(gmedoid.dist^2)
#om K=66 0.5823693


#calculate F
ESD <-R2/(k-1) # averaged explained variance
USD <-(1-R2)/(nrow(seqdata)-k) # averaged explained variance
Fstat <- ESD/USD

message(" [>] Pseudo/median-based-R2: ", format(R2))
message(" [>] Pseudo/median-based-F statistic: ", format(Fstat))
##cat(sprintf("Representation quality: R2=%0.2f F=%0.2f", R2, Fstat))
title(main=main, outer=TRUE)
title(sub=sprintf("Representation quality: R2=%0.2f and F=%0.2f", R2, Fstat), outer=TRUE, line=2)

Generally, I think it should be possible to implement weights and groups for RFSPs:

For frequency weights, there seems to be a rather straightforward approach: I could simply expand the number of cases in my dataset accordingly. However, this could lead to huge datasets and associated memory or speed issues. For sampling weights which are often decimals, this would not work.
Therefore, a more general approach would be helpful. The first step to produce a RFSP could be done with weights using wcmdscale from the vegan-package or weighted clustering measures provided by packages lige e.g. WeightedCluster. The second step, I think would be harder as there could be necessary splits "in cases" overblown by weights. For these instances, it would be necessary to allow that weighted cases could belong to more than one group. Steps 3 to 5 could be then followed as usual.

For groups, I think it should be possible to go through steps 1 to 5 for each group separately if one is not interested in comparing the groups to the general medoid. That would mean that, in case the distance measure is not sensitive to absent/present cases (by e.g. using transition based substitution costs for each group), it is possible to use the same, but differently subsetted distance matrix for all groups.

References
_{Fasang, Anette E. und Tim Futing Liao, 2014: Visualizing Sequences in the Social Sciences: Relative Frequency Sequence Plots, in: Sociological Methods & Research 43, S. 643-676.}

How can you create Relative Frequency Sequence Plots with (Sampling) Weights and Groups using TraMineR?

Answers (1)

Related Questions