Reputation: 1074
Can one concatenate data frame columns based on the column index stored as a vector, possibly using a stringr function (eg., str_c() ) or paste? And then, how can I do so to create a new variable using dplyr mutate()?
I would like to do so for an email aggregation task that will mean reading Excel worksheets containing varying name columns (e.g., just one Name column and sometimes First and Last Names), using the readxl package.
For example: Say I wanted to concatenate all of the column names containing "color" in the starwars data frame installed with the tidyverse R package.
library(tidyverse)
colnames(starwars)
[1] "name" "height" "mass" "hair_color" "skin_color" "eye_color" "birth_year" "gender" "homeworld" "species" "films"
[12] "vehicles" "starships"
I could capture the index of those columns with:
color_vec_num <- which(grepl("COLOR", toupper(colnames(starwars))))
print(color_vec_num)
[1] 4 5 6
What are the column names?
colnames(starwars)[color_vec_num]
[1] "hair_color" "skin_color" "eye_color"
I can link together those column names easily enough by name. Looking at the first 4 rows of starwars:
str_c(starwars$hair_color, starwars$skin_color, starwars$eye_color)[1:4]
[1] "blondfairblue" NA NA "nonewhiteyellow"
However, my attempts at using the column index have failed.
Edit per Akrun's discovery of a typo Borrowing this SO use of double brackets to create (mutate) a new variable works, but this identifies columns singularly and explicitly rather than using a numeric vector to identify each column index.
starwars %>% mutate(newcolor = paste(.[[4]],.[[5]], .[[6]])) %>% select(name, newcolor)
# A tibble: 87 x 2
name newcolor
<chr> <chr>
1 Luke Skywalker blond fair blue
2 C-3PO NA gold yellow
3 R2-D2 NA white, blue red
This attempt, reminiscent of a previous SO answer for a different question, obviously doesn't look right.
paste(starwars %>% select(color_vec_num), collapse = " ", stringsAsFactors = FALSE)
[1] "c(\"blond\", NA, NA, \"none\", \"brown\", \"brown, grey\", \"brown\", NA, \"black\", \"auburn, white\", \"blond\", \"auburn, grey\", \"brown\", \"brown\", NA, NA,
Upvotes: 3
Views: 1052
Reputation: 887158
We can subset the columns with the column index (.[color_vec_num]
) and reduce
it to a single column/vector in mutate
by concatenating (str_c
)
library(tidyverse)
out <- starwars %>%
mutate(newcolor = reduce(.[color_vec_num], str_c))
out %>%
pull(newcolor) %>%
head(4)
#[1] "blondfairblue" NA NA "nonewhiteyellow"
Regarding the error in OP's paste
paste(.[[4]],.[[5]], [[6]]
the last one is missing .
i.e. it should be
starwars %>%
mutate(newcolor = paste(.[[4]],.[[5]], .[[6]]))
Having said that, there is a distinct advantage in using str_c
when compared to paste
or unite
(from tidyr
) with regards to missing values. With str_c
, any missing value in a column returns NA
for the whole row while paste
or unite
converts the NA
to character and returns something like
"NA gold yellow"
for the second element while it is NA
for str_c
Upvotes: 2