Kuba_
Kuba_

Reputation: 814

Mutate within nested data frame

I would like to perform kmeans within groups and add to my data information about cluster number and center which an observation was assigned to (still, within groups so cluster 1 is not the same for group A and group B). I thought that I can pluck cluster assignment and centroid from kmeans and then maybe join these two with each other and finally, with original data. To do the former I wanted to add a row number to data frames with centers and then join by the number of cluster. But how can I add row number within nested data frames? The following code works well until the last, 'nested' mutate.

my_data <- data.frame(group = c(sample(c('A', 'B', 'C'), 20, replace = TRUE)), x = runif(100, 0, 10), y = runif(100, 0, 10))
my_data %>% 
  group_by(group) %>% 
  nest() %>% 
  mutate(km_cluster = map(data, ~kmeans(.x, 3) %>% pluck('cluster')),
         km_centers = map(data, ~kmeans(.x, 3) %>% pluck('centers') %>% mutate(cluster = row_number())))

@Luke.sonnet provided an answer that works well with map, but interestingly not with map2, see below:

my_data %>% 
  group_by(group) %>% 
  nest() %>% 
  mutate(number = sample(3:7, 3)) %>% 
  mutate(km_cluster = map2(data, number, ~kmeans(.x, .y) %>% pluck('cluster')), 
     km_centers = map2(data, number, ~kmeans(.x, .y) %>% pluck('centers') %>% as_tibble() %>% mutate(cluster = row_number())))

Any ideas how to solve the issue in that case? And equally important, what is the cause of such behaviour?

Upvotes: 1

Views: 2472

Answers (1)

luke.sonnet
luke.sonnet

Reputation: 475

The problem is that pluck() is returning a matrix. Cast to a tibble first and number differently.

library(tidyverse)
my_data <- data.frame(group = c(sample(c('A', 'B', 'C'), 20, replace = TRUE)), x = runif(100, 0, 10), y = runif(100, 0, 10))
my_data %>% 
    group_by(group) %>% 
    nest() %>% 
    mutate(number = sample(3:7, 3)) %>% 
    mutate(km_cluster = map2(data, number, ~kmeans(.x, .y) %>% pluck('cluster')), 
           km_centers = map2(data, number, ~kmeans(.x, .y) %>% pluck('centers') %>% as_tibble() %>% mutate(cluster = seq_len(nrow(.)))))

Note you can also do mutate(cluster = row_number(x)))) and this provides different numbers (note that just using row_number() uses the rows from the parent df). I think given kmeans that the matrix of centers is ordered row-wise by cluster number that the answer in the main chunk is correct.

Upvotes: 2

Related Questions