crestor
crestor

Reputation: 1476

pivot a dataframe without aggregation

The intention is to pivot a dataframe (that represents a one to many relation: one computer to many monitors) to a wider representation.

The dataframe (abbreviated) could be:

library(tidyverse)
df <- tibble::tribble(
  ~CPU_ID,    ~ID, ~CONFIGITEM_NUMBER,        ~NAME, ~AllocationDate,                   ~Model,           ~Vendor,
  182434, 195251,       101142000825, "COMP000572",    "2014-04-10", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
  182434, 405022,         1142027261, "COMP030500",    "2020-12-02",                  "V173A",            "ACER",
  182436, 183607,       101142000008, "COMP000008",    "2014-04-18", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
  182437, 228469,         1142006861, "COMP020117",    "2018-03-05",              "S22C45KBW",         "Samsung",
  182437, 341806,         1142019822, "COMP050244",    "2019-01-09",                 "L1940T",              "HP",
  182438, 205930,       101142001009, "COMP050002",    "2019-05-20",              "S22C45KBW",         "Samsung",
  182439, 240546,         1142008622, "COMP050131",    "2016-09-16", "SAMSUNG SYNCMASTER 943",         "SAMSUNG",
  182462, 184114,       101142000515, "COMP000515",    "2019-08-27", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
  182463, 184113,       101142000514, "COMP000514",    "2019-08-28", "HP ELITE DISPLAY E-231", "Hewlett-Packard",
  182464, 184106,       101142000507, "COMP000507",    "2019-08-27", "HP ELITE DISPLAY E-231", "Hewlett-Packard"
)

And I can pivot it correctly with:


df %>%
  group_by(CPU_ID) %>%
  filter(row_number() == 1) %>%
  ungroup() %>%
  rename_with( ~ paste0("monitor1_", .), .cols = !CPU_ID) %>%
  left_join(
    df %>%
      group_by(CPU_ID) %>%
      filter(row_number() == 2) %>%
      ungroup() %>%
      rename_with( ~ paste0("monitor2_", .), .cols = !CPU_ID),
    by = "CPU_ID"
  )
#> # A tibble: 8 x 13
#>   CPU_ID monitor1_ID monitor1_CONFIG~ monitor1_NAME monitor1_Alloca~ monitor1_Model monitor1_Vendor
#>    <dbl>       <dbl>            <dbl> <chr>         <chr>            <chr>          <chr>
#> 1 182434      195251     101142000825 COMP000572    2014-04-10       HP ELITE DISP~ Hewlett-Packard
#> 2 182436      183607     101142000008 COMP000008    2014-04-18       HP ELITE DISP~ Hewlett-Packard
#> 3 182437      228469       1142006861 COMP020117    2018-03-05       S22C45KBW      Samsung
#> 4 182438      205930     101142001009 COMP050002    2019-05-20       S22C45KBW      Samsung
#> 5 182439      240546       1142008622 COMP050131    2016-09-16       SAMSUNG SYNCM~ SAMSUNG
#> 6 182462      184114     101142000515 COMP000515    2019-08-27       HP ELITE DISP~ Hewlett-Packard
#> 7 182463      184113     101142000514 COMP000514    2019-08-28       HP ELITE DISP~ Hewlett-Packard
#> 8 182464      184106     101142000507 COMP000507    2019-08-27       HP ELITE DISP~ Hewlett-Packard
#> # ... with 6 more variables: monitor2_ID <dbl>, monitor2_CONFIGITEM_NUMBER <dbl>,
#> #   monitor2_NAME <chr>, monitor2_AllocationDate <chr>, monitor2_Model <chr>, monitor2_Vendor <chr>

But in the real dataframe, there are cases with more than two monitors per computer, so this formula would need many left_join.

I was trying to write an alternative like:

df %>%
  group_by(CPU_ID) %>%
  mutate(monitor_n = row_number()) %>%
  ungroup() %>%
  pivot_wider(
    id_cols = CPU_ID,
    names_from = monitor_n,
    values_from = !CPU_ID
  ) %>%
  select(-starts_with("monitor_n")) %>%
  rename_with(function(colname)
    str_replace(colname, "^(.*)_(\\d)$", "monitor\\2_\\1"),
    .cols = !CPU_ID)
#> # A tibble: 8 x 13
#>   CPU_ID monitor1_ID monitor2_ID monitor1_CONFIG~ monitor2_CONFIG~ monitor1_NAME monitor2_NAME
#>    <dbl>       <dbl>       <dbl>            <dbl>            <dbl> <chr>         <chr>
#> 1 182434      195251      405022     101142000825       1142027261 COMP000572    COMP030500
#> 2 182436      183607          NA     101142000008               NA COMP000008    <NA>
#> 3 182437      228469      341806       1142006861       1142019822 COMP020117    COMP050244
#> 4 182438      205930          NA     101142001009               NA COMP050002    <NA>
#> 5 182439      240546          NA       1142008622               NA COMP050131    <NA>
#> 6 182462      184114          NA     101142000515               NA COMP000515    <NA>
#> 7 182463      184113          NA     101142000514               NA COMP000514    <NA>
#> 8 182464      184106          NA     101142000507               NA COMP000507    <NA>
#> # ... with 6 more variables: monitor1_AllocationDate <chr>, monitor2_AllocationDate <chr>,
#> #   monitor1_Model <chr>, monitor2_Model <chr>, monitor1_Vendor <chr>, monitor2_Vendor <chr>

But I need to mantain the columns in the same order as the original dataframe.

Could you suggest other easier (neater) alternatives?

Upvotes: 2

Views: 200

Answers (3)

crestor
crestor

Reputation: 1476

For the record, what I finally used was (based on Lenny's and Jon Spring's answers):

df %>%
  pivot_longer(
    cols = !CPU_ID,
    names_to = "variable",
    values_to = "value",
    values_transform = list(value = as.character)
  ) %>%
  group_by(CPU_ID, variable) %>%
  mutate(variable = paste0("monitor", row_number(), "_", variable)) %>%
  ungroup() %>%
  pivot_wider(names_from = variable, values_from = value)

Upvotes: 1

Jon Spring
Jon Spring

Reputation: 66775

Similar to @Lennyy's 2nd solution, I would suggest pivoting longer and then pivoting wider. A potential downside is you need to at least temporarily make them all the same type, e.g. character, but you could convert any of those back at the end if necessary.

df %>%
  pivot_longer(cols = -CPU_ID, names_to = "variable", values_to = "value",
               values_transform = list(value = as.character)) %>%
  group_by(CPU_ID, variable) %>%
  mutate(variable = paste(variable, row_number(), sep = "_")) %>%
  ungroup() %>%
  pivot_wider(names_from = variable, values_from = value)


# A tibble: 8 x 13
  CPU_ID ID_1   CONFIGITEM_NUMBER… NAME_1   AllocationDate_1 Model_1        Vendor_1    ID_2  CONFIGITEM_NUMBE… NAME_2  AllocationDate_2 Model_2 Vendor_2
   <dbl> <chr>  <chr>              <chr>    <chr>            <chr>          <chr>       <chr> <chr>             <chr>   <chr>            <chr>   <chr>   
1 182434 195251 101142000825       COMP000… 2014-04-10       HP ELITE DISP… Hewlett-Pa… 4050… 1142027261        COMP03… 2020-12-02       V173A   ACER    
2 182436 183607 101142000008       COMP000… 2014-04-18       HP ELITE DISP… Hewlett-Pa… NA    NA                NA      NA               NA      NA      
3 182437 228469 1142006861         COMP020… 2018-03-05       S22C45KBW      Samsung     3418… 1142019822        COMP05… 2019-01-09       L1940T  HP      
4 182438 205930 101142001009       COMP050… 2019-05-20       S22C45KBW      Samsung     NA    NA                NA      NA               NA      NA      
5 182439 240546 1142008622         COMP050… 2016-09-16       SAMSUNG SYNCM… SAMSUNG     NA    NA                NA      NA               NA      NA      
6 182462 184114 101142000515       COMP000… 2019-08-27       HP ELITE DISP… Hewlett-Pa… NA    NA                NA      NA               NA      NA      
7 182463 184113 101142000514       COMP000… 2019-08-28       HP ELITE DISP… Hewlett-Pa… NA    NA                NA      NA               NA      NA      
8 182464 184106 101142000507       COMP000… 2019-08-27       HP ELITE DISP… Hewlett-Pa… NA    NA                NA      NA               NA      NA 

Upvotes: 3

Lennyy
Lennyy

Reputation: 6132

Perhaps something along these lines?

df %>% 
  group_by(CPU_ID) %>% 
  mutate(rowno = row_number()) %>% 
  ungroup %>% 
  gather(var, val, -CPU_ID, -rowno) %>% 
  mutate(newcolname = paste0("monitor", rowno, "_", var)) %>% 
  select(-c(var, rowno)) %>% 
  pivot_wider(names_from = newcolname, values_from = val)

# A tibble: 8 x 13
  CPU_ID monitor1_ID monitor2_ID monitor1_CONFIG~ monitor2_CONFIG~ monitor1_NAME monitor2_NAME monitor1_Alloca~ monitor2_Alloca~ monitor1_Model
   <dbl> <chr>       <chr>       <chr>            <chr>            <chr>         <chr>         <chr>            <chr>            <chr>         
1 182434 195251      405022      101142000825     1142027261       COMP000572    COMP030500    2014-04-10       2020-12-02       HP ELITE DISP~
2 182436 183607      NA          101142000008     NA               COMP000008    NA            2014-04-18       NA               HP ELITE DISP~
3 182437 228469      341806      1142006861       1142019822       COMP020117    COMP050244    2018-03-05       2019-01-09       S22C45KBW     
4 182438 205930      NA          101142001009     NA               COMP050002    NA            2019-05-20       NA               S22C45KBW     
5 182439 240546      NA          1142008622       NA               COMP050131    NA            2016-09-16       NA               SAMSUNG SYNCM~
6 182462 184114      NA          101142000515     NA               COMP000515    NA            2019-08-27       NA               HP ELITE DISP~
7 182463 184113      NA          101142000514     NA               COMP000514    NA            2019-08-28       NA               HP ELITE DISP~
8 182464 184106      NA          101142000507     NA               COMP000507    NA            2019-08-27       NA               HP ELITE DISP~
# ... with 3 more variables: monitor2_Model <chr>, monitor1_Vendor <chr>, monitor2_Vendor <chr>

Could use pivot_longer as well, but it changes the order of columns (which could be corrected if needed):

df %>% 
  group_by(CPU_ID) %>% 
  mutate(rowno = row_number()) %>% 
  ungroup %>% 
  pivot_longer(-c(CPU_ID, rowno), names_to = "var", values_to = "val", values_transform = list(val = as.character)) %>% 
  mutate(newcolname = paste0("monitor", rowno, "_", var)) %>% 
  select(-c(var, rowno)) %>% 
  pivot_wider(names_from = newcolname, values_from = val)

# A tibble: 8 x 13
  CPU_ID monitor1_ID monitor1_CONFIG~ monitor1_NAME monitor1_Alloca~ monitor1_Model monitor1_Vendor monitor2_ID monitor2_CONFIG~ monitor2_NAME
   <dbl> <chr>       <chr>            <chr>         <chr>            <chr>          <chr>           <chr>       <chr>            <chr>        
1 182434 195251      101142000825     COMP000572    2014-04-10       HP ELITE DISP~ Hewlett-Packard 405022      1142027261       COMP030500   
2 182436 183607      101142000008     COMP000008    2014-04-18       HP ELITE DISP~ Hewlett-Packard NA          NA               NA           
3 182437 228469      1142006861       COMP020117    2018-03-05       S22C45KBW      Samsung         341806      1142019822       COMP050244   
4 182438 205930      101142001009     COMP050002    2019-05-20       S22C45KBW      Samsung         NA          NA               NA           
5 182439 240546      1142008622       COMP050131    2016-09-16       SAMSUNG SYNCM~ SAMSUNG         NA          NA               NA           
6 182462 184114      101142000515     COMP000515    2019-08-27       HP ELITE DISP~ Hewlett-Packard NA          NA               NA           
7 182463 184113      101142000514     COMP000514    2019-08-28       HP ELITE DISP~ Hewlett-Packard NA          NA               NA           
8 182464 184106      101142000507     COMP000507    2019-08-27       HP ELITE DISP~ Hewlett-Packard NA          NA               NA           
# ... with 3 more variables: monitor2_AllocationDate <chr>, monitor2_Model <chr>, monitor2_Vendor <chr>

Upvotes: 1

Related Questions