Reputation: 139
I have a dataframe of more than 5000 observations. In my attempt to analyse my data using hierarchical clustering, I have 8 clusters, where some of the clusters contain either a few 100 or 1000 individual observations.
# Cut tree into 8 groups
cutree_hclust <- cutree(hclust.unsupervised, k = 8)
# Number of members in each cluster
table(cutree_hclust)
Next is an illustration of the size of each cluster:
cutree_hclust
1 2 3 4 5 6 7 8
867 61 14 310 1135 432 119 5
To get a view of what variable combination there is for each observation in the different clusters, I thought that it might be an idea to make the 8 clusters as dataframes, so I can analyse them separately. This because I have not idea what different rows are in the different columns and therefore don't know what the pattern in the overall datafram (Complete_df) is.
However, how can I make these new dataframes?
I can see what I assume to be the rows in the different clusters by, fx:
rownames(MY_df)[cutree_hclust == 7]
[1] "60" "72" "92" "97" "110" "210" "211" "267"
[9] "565"
But if I type:
h_clust <- as.dataframe( rownames(MY_df)[cutree_hclust == 7])
I only get a view (as a list) of what rows are in this cluster and all the other columns are not included.
How can i select these specific rows in my dataframe called: Complete_df - so that I can see what the overall variable combination is for each cluster?
I have tried the following:
rn <- rownames(MY_df)[cutree_hclust == 7]; subset(Complete_df, rn %in% rownames(MY_df))
- this from: R how to select several rows to make a new dataframe
and
Clust_7 <- rownames(MY_df)[cutree_hclust == 7]
Clust_7_df <- data.frame(matrix(unlist(Clust_7), nrow=9, byrow=T))
The above attempst did not work.
I look forward to hearing back from anyone who can help - as I have not been able to figure this out for myself :-)
Upvotes: 2
Views: 661
Reputation: 9865
I take as an example data frame mtcars
.
df <- mtcars
Now cluster the df:
hclust.unsupervised <- hclust(dist(df))
And create the cutree of it with k = 8
cutree_hclust <- cutree(hclust.unsupervised, k = 8)
str(cutree_hclust)
shows that it is an integer vector with the cluster number assigned to the name of each row in df.
Thus the best would be, to add this vector as an additional column to your original data frame:
df$cluster <- cutree_hclust
Now, you can split this original data frame to a list of sub-data frames by the df$cluster column value:
df.list <- split(df, df$cluster)
I think this list of data frames contains the sub-dataframes for each cluster which was what you wanted.
$`1`
mpg cyl disp hp drat wt qsec vs am gear carb cluster
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 1
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1
$`2`
mpg cyl disp hp drat wt qsec vs am gear carb cluster
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2
$`3`
mpg cyl disp hp drat wt qsec vs am gear carb cluster
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3
Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 3
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 3
Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2 3
Ford Pantera L 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4 3
$`4`
mpg cyl disp hp drat wt qsec vs am gear carb cluster
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 4
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 4
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 4
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 4
$`5`
mpg cyl disp hp drat wt qsec vs am gear carb cluster
Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 5
Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 5
Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 5
$`6`
mpg cyl disp hp drat wt qsec vs am gear carb cluster
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 6
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 6
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 6
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 6
$`7`
mpg cyl disp hp drat wt qsec vs am gear carb cluster
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 7
$`8`
mpg cyl disp hp drat wt qsec vs am gear carb cluster
Maserati Bora 15 8 301 335 3.54 3.57 14.6 0 1 5 8 8
Since the name of the df.list elements is the cluster number, you can get the data frame for cluster 3 e.g. by calling
df.list[[3]]
which gives:
mpg cyl disp hp drat wt qsec vs am gear carb cluster
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3
Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 3
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 3
Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2 3
Ford Pantera L 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4 3
Upvotes: 1
Reputation: 6222
I think your aim is to separate clusters into individual data.frames. If so, the following could help.
It uses the USArrests
data. Only a part of the final output is shown.
hc <- hclust(dist(USArrests), "ave")
k <- 8
cutree_hclust <- cutree(hc, k = k)
df_list <- lapply(1 : k, function(x) USArrests[which(cutree_hclust == x), ])
df_list
# [[1]]
# Murder Assault UrbanPop Rape
# Alabama 13.2 236 58 21.2
# Alaska 10.0 263 48 44.5
# Delaware 5.9 238 72 15.8
# Illinois 10.4 249 83 24.0
# Louisiana 15.4 249 66 22.2
# Michigan 12.1 255 74 35.1
# Mississippi 16.1 259 44 17.1
# Nevada 12.2 252 81 46.0
# New York 11.1 254 86 26.1
# South Carolina 14.4 279 48 22.5
#
# [[2]]
# Murder Assault UrbanPop Rape
# Arizona 8.1 294 80 31.0
# California 9.0 276 91 40.6
# Maryland 11.3 300 67 27.8
# New Mexico 11.4 285 70 32.1
Upvotes: 3