Javiss
Javiss

Reputation: 805

How to pivot a complex tidy data frame using dplyr

I have a tidy dataset that has several numerical and categorical columns, for instance id, categorical_1, categorical_2, etc. Also, I got columns that follows a patter according to:

n_neighbors <- c(15,30,50,100,150)
min_distance <- c( 0.001, 0.003, 0.009,0.03,0.09)
metrics <- c("euclidean" ,"cosine","hamming")

enter image description here

In the example above, I just show 4 columns of the data frame. The name of the following columns refer to:

  1. nn_.15.md_.0.001.metric.euclidean.1
  1. nn_.15.md_.0.001.metric.euclidean.2

I would like to pivot the table in a way that there is new columns created: dimension_1, dimension_2, n_neighbors, min_distance, metric. If possible using deplyr and tidy dataframes

dimension_1 and dimension_2 should contain the values of the rows In the above example, while n_neighbors, min_distance, metric should contain the value written in the column name. The output that I aim for is:

row_1 -> dimension_1=0.5677311, dimension_2=-11.70898, n_neighbors=15, min_distance=0.001, metric = euclidean

row_2 -> dimension_1=0.5682809, dimension_2=-11.71056, n_neighbors=15, min_distance=0.001, metric = euclidean

row_3 -> dimension_1=0.5674967, dimension_2=-11.70665, n_neighbors=15, min_distance=0.001, metric = euclidean

row_4 -> dimension_1=0.5687164, dimension_2=-11.70848, n_neighbors=15, min_distance=0.001, metric = euclidean

row_5 -> dimension_1=-1.674230, dimension_2=16.693973, n_neighbors=15, min_distance=0.001, metric = cosine

..

..

row_8 -> dimension_1=-1.674726, dimension_2=16.69405, n_neighbors=15, min_distance=0.001, metric = cosine

Here you can find the code example for the subset of data:

structure(list(nn_.15.md_.0.001.metric.euclidean.1 = c(0.567731082439423, 
0.568280875682831, 0.567496657371521, 0.568716406822205), nn_.15.md_.0.001.metric.euclidean.2 = c(-11.7089824676514, 
-11.7105579376221, -11.7066516876221, -11.7084884643555), nn_.15.md_.0.001.metric.cosine.1 = c(-1.67423057556152, 
-1.67501986026764, -1.69689452648163, -1.67472624778748), nn_.15.md_.0.001.metric.cosine.2 = c(16.6939735412598, 
16.6941356658936, 16.7890815734863, 16.6940479278564))

Upvotes: 0

Views: 102

Answers (2)

Anoushiravan R
Anoushiravan R

Reputation: 21908

You can use the following solution:

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(everything(), names_to = c(NA, "n_neighbour", NA, "min_distance", NA, ".value"), 
               names_pattern = "(nn_\\.)(\\d+)\\.(md_\\.)(\\d\\.\\d+)\\.(metric\\.)(.*)") %>%
  pivot_longer(!c(min_distance, n_neighbour), 
               names_to = c("metric", ".value"), names_pattern = "(.*)\\.(\\d)") %>%
  rename_with(~ paste("Dimension", .), c(4, 5)) %>%
  relocate(`Dimension 1`, `Dimension 2`) %>%
  arrange(desc(metric))

# A tibble: 8 x 5
  `Dimension 1` `Dimension 2` min_distance n_neighbour metric   
          <dbl>         <dbl> <chr>        <chr>       <chr>    
1         0.568         -11.7 0.001        15          euclidean
2         0.568         -11.7 0.001        15          euclidean
3         0.567         -11.7 0.001        15          euclidean
4         0.569         -11.7 0.001        15          euclidean
5        -1.67           16.7 0.001        15          cosine   
6        -1.68           16.7 0.001        15          cosine   
7        -1.70           16.8 0.001        15          cosine   
8        -1.67           16.7 0.001        15          cosine  

Upvotes: 4

Ronak Shah
Ronak Shah

Reputation: 388817

You can use pivot_longer to get the data in long format.

library(tidyverse)

df %>%
  pivot_longer(cols = everything(), 
               names_to = c('n_neighbors', 'min_distance', 'metric', 'row'), 
               names_pattern = 'nn_\\.(\\d+)\\.md_\\.(\\d+\\.\\d+)\\.metric\\.([a-z]+)\\.(\\d+)')

#   n_neighbors min_distance metric    row     value
#   <chr>       <chr>        <chr>     <chr>   <dbl>
# 1 15          0.001        euclidean 1       0.568
# 2 15          0.001        euclidean 2     -11.7  
# 3 15          0.001        cosine    1      -1.67 
# 4 15          0.001        cosine    2      16.7  
# 5 15          0.001        euclidean 1       0.568
# 6 15          0.001        euclidean 2     -11.7  
# 7 15          0.001        cosine    1      -1.68 
# 8 15          0.001        cosine    2      16.7  
# 9 15          0.001        euclidean 1       0.567
#10 15          0.001        euclidean 2     -11.7  
#11 15          0.001        cosine    1      -1.70 
#12 15          0.001        cosine    2      16.8  
#13 15          0.001        euclidean 1       0.569
#14 15          0.001        euclidean 2     -11.7  
#15 15          0.001        cosine    1      -1.67 
#16 15          0.001        cosine    2      16.7  

If you want data exactly as shown you can add a pivot_wider step.

df %>%
  pivot_longer(cols = everything(), 
               names_to = c('n_neighbors', 'min_distance', 'metric', 'row'), 
               names_pattern = 'nn_\\.(\\d+)\\.md_\\.(\\d+\\.\\d+)\\.metric\\.([a-z]+)\\.(\\d+)') %>%
  pivot_wider(names_from = row, values_from = value, names_prefix = 'dimension_', values_fn = list) %>%
  unnest(starts_with('dimension'))

#  n_neighbors min_distance metric    dimension_1 dimension_2
#  <chr>       <chr>        <chr>           <dbl>       <dbl>
#1 15          0.001        euclidean       0.568       -11.7
#2 15          0.001        euclidean       0.568       -11.7
#3 15          0.001        euclidean       0.567       -11.7
#4 15          0.001        euclidean       0.569       -11.7
#5 15          0.001        cosine         -1.67         16.7
#6 15          0.001        cosine         -1.68         16.7
#7 15          0.001        cosine         -1.70         16.8
#8 15          0.001        cosine         -1.67         16.7

Upvotes: 1

Related Questions