Rick Pack
Rick Pack

Reputation: 1074

Concatenate data.frame character columns based on column index stored as a vector in R w/ dplyr mutate()?

Can one concatenate data frame columns based on the column index stored as a vector, possibly using a stringr function (eg., str_c() ) or paste? And then, how can I do so to create a new variable using dplyr mutate()?

I would like to do so for an email aggregation task that will mean reading Excel worksheets containing varying name columns (e.g., just one Name column and sometimes First and Last Names), using the readxl package.

For example: Say I wanted to concatenate all of the column names containing "color" in the starwars data frame installed with the tidyverse R package.

 library(tidyverse)
 colnames(starwars)
 [1] "name"       "height"     "mass"       "hair_color" "skin_color" "eye_color"  "birth_year" "gender"     "homeworld"  "species"    "films"     
 [12] "vehicles"   "starships" 

I could capture the index of those columns with:

color_vec_num <- which(grepl("COLOR", toupper(colnames(starwars))))
print(color_vec_num)
[1] 4 5 6

What are the column names?

colnames(starwars)[color_vec_num]
[1] "hair_color" "skin_color" "eye_color"

I can link together those column names easily enough by name. Looking at the first 4 rows of starwars:

str_c(starwars$hair_color, starwars$skin_color, starwars$eye_color)[1:4]
[1] "blondfairblue"   NA                NA                "nonewhiteyellow"

However, my attempts at using the column index have failed.

Edit per Akrun's discovery of a typo Borrowing this SO use of double brackets to create (mutate) a new variable works, but this identifies columns singularly and explicitly rather than using a numeric vector to identify each column index.

starwars %>% mutate(newcolor = paste(.[[4]],.[[5]], .[[6]])) %>% select(name, newcolor)
# A tibble: 87 x 2
   name               newcolor                    
   <chr>              <chr>                       
 1 Luke Skywalker     blond fair blue             
 2 C-3PO              NA gold yellow              
 3 R2-D2              NA white, blue red   

This attempt, reminiscent of a previous SO answer for a different question, obviously doesn't look right.

paste(starwars %>% select(color_vec_num), collapse = " ", stringsAsFactors = FALSE)
[1] "c(\"blond\", NA, NA, \"none\", \"brown\", \"brown, grey\", \"brown\", NA, \"black\", \"auburn, white\", \"blond\", \"auburn, grey\", \"brown\", \"brown\", NA, NA,

Upvotes: 3

Views: 1052

Answers (1)

akrun
akrun

Reputation: 887158

We can subset the columns with the column index (.[color_vec_num]) and reduce it to a single column/vector in mutate by concatenating (str_c)

library(tidyverse)
out <- starwars %>%
           mutate(newcolor = reduce(.[color_vec_num], str_c))
out %>% 
   pull(newcolor) %>% 
   head(4)
#[1] "blondfairblue"   NA                NA                "nonewhiteyellow"

Regarding the error in OP's paste

paste(.[[4]],.[[5]], [[6]]

the last one is missing . i.e. it should be

starwars %>%
   mutate(newcolor = paste(.[[4]],.[[5]], .[[6]]))

Having said that, there is a distinct advantage in using str_c when compared to paste or unite (from tidyr) with regards to missing values. With str_c, any missing value in a column returns NA for the whole row while paste or unite converts the NA to character and returns something like

"NA gold yellow" 

for the second element while it is NA for str_c

Upvotes: 2

Related Questions