Ben G
Ben G

Reputation: 4338

Concatenate two columns of a dataframe in R if all values between them are NA?

I have a data frame that looks like this:

> sample
# A tibble: 6 x 10
  Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 Level_7 Level_8 Level_9 Supplier 
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <lgl>   <chr>    
1       1       2       3       4       8      NA      NA      NA NA      orioles  
2       1       2       3       4       9      13      NA      NA NA      nationals
3       1       2       3       5      10      14      16      18 NA      dodgers  
4       1       2       3       5      10      14      17      19 NA      cardinals
5       1       2       3       6      11      NA      NA      NA NA      giants   
6       1       2       3       7      12      15      NA      NA NA      padres  

What I'd like to do is concatenate the Supplier column with any Level column if all the values between them are NA. Another way I was thinking about this was that if the column to the right of the Level column is NA then to concatenate that column with the supplier column.

I was thinking a for loop but I haven't figured out how to implement the logic. The logic I was thinking is something like:

for (level in levels) {
   if is.na(level n + 1) {
     paste0(level, Supplier)
     }
   else {
    level}
   }

I could also do a bunch of mutate calls like this but it seems super repetitive and unnecessary:

sample %>%
  mutate(
    Level_5 = ifelse(
      is.na(Level_6),
      paste0(Supplier, "<br>", Level_5),
      Level_5)
  )

Here's the dput of the data:

structure(list(Level_1 = c(1, 1, 1, 1, 1, 1), Level_2 = c(2, 
2, 2, 2, 2, 2), Level_3 = c(3, 3, 3, 3, 3, 3), Level_4 = c(4, 
4, 5, 5, 6, 7), Level_5 = c(8, 9, 10, 10, 11, 12), Level_6 = c(NA, 
13, 14, 14, NA, 15), Level_7 = c(NA, NA, 16, 17, NA, NA), Level_8 = c(NA, 
NA, 18, 19, NA, NA), Level_9 = c(NA, NA, NA, NA, NA, NA), Supplier = c("orioles", 
"nationals", "dodgers", "cardinals", "giants", "padres")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 4

Views: 170

Answers (5)

Anoushiravan R
Anoushiravan R

Reputation: 21938

Final Update I realized my mistake on trying to find the max value in every row and replace it with desired concatenated string. So I came up with another solution which only replaces the last non-NA value (it can also be not the max values of the row), given all values are not numeric. So here is my final solution:

library(dplyr)
library(stringr)
library(purrr)

df %>%
  pmap_dfr(., ~ {x <- c(...)[-10][!is.na(c(...)[-10])];
  ind <- which(c(...) == x[length(x)]);
  replace(c(...), ind[length(ind)], str_c(..10, x[length(x)], sep = "_"))}
  )

# A tibble: 6 x 10
  Level_1 Level_2 Level_3 Level_4 Level_5   Level_6      Level_7 Level_8      Level_9 Supplier 
  <chr>   <chr>   <chr>   <chr>   <chr>     <chr>        <chr>   <chr>        <chr>   <chr>    
1 1       2       3       4       orioles_8 NA           NA      NA           NA      orioles  
2 1       2       3       4       9         nationals_13 NA      NA           NA      nationals
3 1       2       3       5       10        14           16      dodgers_18   NA      dodgers  
4 1       2       3       5       10        14           17      cardinals_19 NA      cardinals
5 1       2       3       6       giants_11 NA           NA      NA           NA      giants   
6 1       2       3       7       12        padres_15    NA      NA           NA      padres 

Upvotes: 3

AnilGoyal
AnilGoyal

Reputation: 26238

Combined approach with c_across and across

library(tidyverse)
df %>% rowwise() %>%
  mutate(dummy = max(which(!is.na(c_across(starts_with('Level')))))) %>% ungroup() %>%
  mutate(across(starts_with('Level_'), 
                ~ifelse(as.numeric(str_remove(cur_column(), 'Level_')) == dummy, paste(Supplier, ., sep = '_'), .)))

# A tibble: 6 x 11
  Level_1 Level_2 Level_3 Level_4 Level_5   Level_6      Level_7 Level_8      Level_9 Supplier  dummy
    <dbl>   <dbl>   <dbl>   <dbl> <chr>     <chr>          <dbl> <chr>        <lgl>   <chr>     <int>
1       1       2       3       4 orioles_8 NA                NA NA           NA      orioles       5
2       1       2       3       4 9         nationals_13      NA NA           NA      nationals     6
3       1       2       3       5 10        14                16 dodgers_18   NA      dodgers       8
4       1       2       3       5 10        14                17 cardinals_19 NA      cardinals     8
5       1       2       3       6 giants_11 NA                NA NA           NA      giants        5
6       1       2       3       7 12        padres_15         NA NA           NA      padres        6

Combining which strategy used above my friend Anoushiravan's answer can be simplified to:

  • Inside purrr::pmap_dfr do these
    • collect/store number of columns into temp variable n
    • collect/store desired index into temp variable i
    • create a temp variable of length n and having a T at i and F elsewhere
    • use replace to replace ith variable (using temp) with desired values
df %>%
  pmap_dfr(., ~ {n <- ncol(df); i <- max(which(!is.na(c(...)[-n])));
  tmp <- rep(F, n); tmp[i] <- T;
  replace(c(...), tmp, paste(c(...)[n], c(...)[i], sep = '_'))})

# A tibble: 6 x 10
  Level_1 Level_2 Level_3 Level_4 Level_5   Level_6      Level_7 Level_8      Level_9 Supplier 
  <chr>   <chr>   <chr>   <chr>   <chr>     <chr>        <chr>   <chr>        <chr>   <chr>    
1 1       2       3       4       orioles_8 NA           NA      NA           NA      orioles  
2 1       2       3       4       9         nationals_13 NA      NA           NA      nationals
3 1       2       3       5       10        14           16      dodgers_18   NA      dodgers  
4 1       2       3       5       10        14           17      cardinals_19 NA      cardinals
5 1       2       3       6       giants_11 NA           NA      NA           NA      giants   
6 1       2       3       7       12        padres_15    NA      NA           NA      padres  

Upvotes: 2

TarJae
TarJae

Reputation: 79246

Very interesting question. Here is my approach without pivot_longer:

library(dplyr)
# check is na
ind <- !is.na(df1)

# save vector who fullfill assumption value before first NA
values <- as.vector(tapply(df1[ind], row(df1)[ind], tail, 1))

# bind to dataframe
df2 <- cbind(df, values)

# accomplish the task
df2 %>% 
  mutate(across(Level_1:Level_9, ~ case_when(. == values ~ str_c(Supplier ,.),
                                             . != values ~ as.character(.)))) %>% 
  select(-values)

Output:

  Level_1 Level_2 Level_3 Level_4  Level_5     Level_6 Level_7     Level_8 Level_9  Supplier
1       1       2       3       4 orioles8        <NA>    <NA>        <NA>    <NA>   orioles
2       1       2       3       4        9 nationals13    <NA>        <NA>    <NA> nationals
3       1       2       3       5       10          14      16   dodgers18    <NA>   dodgers
4       1       2       3       5       10          14      17 cardinals19    <NA> cardinals
5       1       2       3       6 giants11        <NA>    <NA>        <NA>    <NA>    giants
6       1       2       3       7       12    padres15    <NA>        <NA>    <NA>    padres

Upvotes: 2

Karthik S
Karthik S

Reputation: 11546

Another approach:

library(tidyr)
library(dplyr)

df %>% mutate(across(contains('Level'), ~ as.character(.))) %>% 
   mutate(across(contains('Level'), ~ coalesce(., Supplier))) %>% select(-Supplier) %>% 
   mutate(ID = row_number()) %>% 
     pivot_longer(cols = -ID) %>% group_by(ID) %>% 
       mutate(value = case_when(duplicated(value) ~ NA_character_, TRUE ~ value)) %>% pivot_wider(names_from = name, values_from = value) %>% 
   ungroup() %>% select(-ID)
# A tibble: 6 x 9
  Level_1 Level_2 Level_3 Level_4 Level_5 Level_6 Level_7   Level_8 Level_9  
  <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <chr>   <chr>    
1 1       2       3       4       8       orioles NA        NA      NA       
2 1       2       3       4       9       13      nationals NA      NA       
3 1       2       3       5       10      14      16        18      dodgers  
4 1       2       3       5       10      14      17        19      cardinals
5 1       2       3       6       11      giants  NA        NA      NA       
6 1       2       3       7       12      15      padres    NA      NA       

Upvotes: 2

Martin Gal
Martin Gal

Reputation: 16998

To be honest, I'm not 100% sure about your desired output. Using dplyr and tidyr:

library(tidyr)
library(dplyr)

sample %>%
  pivot_longer(cols=starts_with("Level_"), names_prefix="Level_", names_to="level") %>%
  drop_na() %>%
  group_by(Supplier) %>%
  mutate(new_val=ifelse(level==max(level), paste0(Supplier, "<br>", value), value)) %>%
  select(-value) %>%
  pivot_wider(names_from=level, names_prefix="Level_", values_from=new_val)

returns

# A tibble: 6 x 9
# Groups:   Supplier [6]
  Supplier  Level_1 Level_2 Level_3 Level_4 Level_5      Level_6         Level_7 Level_8        
  <chr>     <chr>   <chr>   <chr>   <chr>   <chr>        <chr>           <chr>   <chr>          
1 orioles   1       2       3       4       orioles<br>8 NA              NA      NA             
2 nationals 1       2       3       4       9            nationals<br>13 NA      NA             
3 dodgers   1       2       3       5       10           14              16      dodgers<br>18  
4 cardinals 1       2       3       5       10           14              17      cardinals<br>19
5 giants    1       2       3       6       giants<br>11 NA              NA      NA             
6 padres    1       2       3       7       12           padres<br>15    NA      NA  

I lost the Level_9 column since it contained only NA. You can easily add it again.

Upvotes: 3

Related Questions