Reputation: 469
I'd like to create a new column in R that concatenates several strings based on whether or not several columns are marked 'X'.
Here is the data I have:
Column1 Column2 Column3 Column4
X X X
X X X
X X
I'd like to create a new Column5 that will include each of the following if there was an 'X' entered:
Column1: 'Texas'
Column2: 'California'
Column3: 'New Jersey'
Column4: 'Oklahoma'
I'm able to do this with quite a bit of code in R, but I think there is a more concise way of doing it with dplyr
.
Upvotes: 1
Views: 2098
Reputation: 16862
There may be a little you need to tweak based on data types—I pasted in what you have here, which is that columns without checkmarks are just blank.
The method that I used is to create row numbers to identify the observations that you start out with, convert to long-shaped data, group by row number, find states that are checked off, collapse them into one string, and reshape back to a wide format. The reason for doing it this way is that it will scale well—it doesn't matter how many states there are, because I'm not doing something like Texas == "X" & California == "X" & ...
that would require hardcoding.
The first major step is using tidyr::gather
so you have rows, all possible values of states, and the checkmarks or blanks.
library(tidyverse)
df <- "Column1 Column2 Column3 Column4
X X X
X X X
X X" %>% read_table()
df %>%
setNames(c("Texas", "California", "New Jersey", "Oklahoma")) %>%
mutate(row = row_number()) %>%
gather(key = state, value = value, -row)
#> # A tibble: 12 x 3
#> row state value
#> <int> <chr> <chr>
#> 1 1 Texas X
#> 2 2 Texas X
#> 3 3 Texas X
#> 4 1 California X
#> 5 2 California ""
#> 6 3 California ""
#> 7 1 New Jersey X
#> 8 2 New Jersey X
#> 9 3 New Jersey ""
#> 10 1 Oklahoma ""
#> 11 2 Oklahoma X
#> 12 3 Oklahoma X
Then I group by the row numbers, and use a stringr
convenience function. str_which(value, "^X$")
finds the locations where value
contains the regex ^X$
. Using this as the indices of state
gets the entries in state
that correspond to an X in value
. Then I collapse those strings into a single string column, and use tidyr::spread
to make it back into a wide format.
df %>%
setNames(c("Texas", "California", "New Jersey", "Oklahoma")) %>%
mutate(row = row_number()) %>%
gather(key = state, value = value, -row) %>%
group_by(row) %>%
mutate(states = state[str_which(value, "^X$")] %>% paste(collapse = ", ")) %>%
spread(key = state, value = value)
#> # A tibble: 3 x 6
#> # Groups: row [3]
#> row states California `New Jersey` Oklahoma Texas
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Texas, California, New Jer… X X "" X
#> 2 2 Texas, New Jersey, Oklahoma "" X X X
#> 3 3 Texas, Oklahoma "" "" X X
Created on 2018-10-11 by the reprex package (v0.2.1)
Upvotes: 1
Reputation: 12165
df <- data.frame(c1 = c(T,T,T),
c2 = c(T,F,F),
c3 = c(T,T,F),
c4 = c(F,T,T))
A vector with the state names in the same order as the columns the correspond to.
sts = c('Texas', 'California', 'New Jersey', "Oklahoma")
Now you can test each column to get the indices of TRUE
columns, then grab the corresponding states from sts
vector and paste
them together.
In the example above, the data frame contains TRUE
and FALSE
, but if you want to use a character
value (for example 'X'
) to select cells, just change the test in the which
statement from == TRUE
to == 'X'
, for example.
Note that this currently requires you to specify the column names. (The plus side of this is that it won't have any problems if you have additional columns that you don't want to consider)
df %>%
rowwise() %>%
mutate(c5 = paste0(sts[which(c(c1,c2,c3,c4) == TRUE)], collapse = ', '))
Source: local data frame [3 x 5]
Groups: <by row>
# A tibble: 3 x 5
c1 c2 c3 c4 c5
<lgl> <lgl> <lgl> <lgl> <chr>
1 TRUE TRUE TRUE FALSE Texas, California, New Jersey
2 TRUE FALSE TRUE TRUE Texas, New Jersey, Oklahoma
3 TRUE FALSE FALSE TRUE Texas, Oklahoma
Upvotes: 1
Reputation: 6776
Here is one approach that might be viable:
df = data_frame(c1='x', c2=c('x', NA, NA), c3=c('x', 'x', NA), c4=c(NA, 'x', 'x'))
values = c('TX', 'CA', 'NJ', 'OK')
df$c5 = sapply(df, function(x) !is.na(x)) %>%
apply(MARGIN=1, FUN=function(x) paste(values[x], collapse=', '))
df
# A tibble: 3 x 5
c1 c2 c3 c4 c5
<chr> <chr> <chr> <chr> <chr>
1 x x x NA Texas, California, New Jersey
2 x NA x x Texas, New Jersey, Oklahoma
3 x NA NA x Texas, Oklahoma
The sapply
loops through the dataframe checking if the value is missing or not in order to get a matrix of TRUE
/FALSE
values. That matrix is then looped over, passing the row of T
/F
values into an anonymous function that indexes values
and pastes the results. The output from the chained sapply
and apply
functions is a vector of the strings you're looking for equal in length to the number of rows in df
. So you can just set this as your new column. Hope that makes sense.
Upvotes: 1