statsguyz
statsguyz

Reputation: 469

Using dplyr to create a list of strings

I'd like to create a new column in R that concatenates several strings based on whether or not several columns are marked 'X'.

Here is the data I have:

Column1   Column2   Column3   Column4
      X         X         X         
      X                   X         X
      X                             X

I'd like to create a new Column5 that will include each of the following if there was an 'X' entered:

Column1: 'Texas'
Column2: 'California'
Column3: 'New Jersey'
Column4: 'Oklahoma'

I'm able to do this with quite a bit of code in R, but I think there is a more concise way of doing it with dplyr.

Upvotes: 1

Views: 2098

Answers (3)

camille
camille

Reputation: 16862

There may be a little you need to tweak based on data types—I pasted in what you have here, which is that columns without checkmarks are just blank.

The method that I used is to create row numbers to identify the observations that you start out with, convert to long-shaped data, group by row number, find states that are checked off, collapse them into one string, and reshape back to a wide format. The reason for doing it this way is that it will scale well—it doesn't matter how many states there are, because I'm not doing something like Texas == "X" & California == "X" & ... that would require hardcoding.

The first major step is using tidyr::gather so you have rows, all possible values of states, and the checkmarks or blanks.

library(tidyverse)

df <- "Column1   Column2   Column3   Column4
      X         X         X         
      X                   X         X
      X                             X" %>% read_table()

df %>%
  setNames(c("Texas", "California", "New Jersey", "Oklahoma")) %>%
  mutate(row = row_number()) %>%
  gather(key = state, value = value, -row)
#> # A tibble: 12 x 3
#>      row state      value
#>    <int> <chr>      <chr>
#>  1     1 Texas      X    
#>  2     2 Texas      X    
#>  3     3 Texas      X    
#>  4     1 California X    
#>  5     2 California ""   
#>  6     3 California ""   
#>  7     1 New Jersey X    
#>  8     2 New Jersey X    
#>  9     3 New Jersey ""   
#> 10     1 Oklahoma   ""   
#> 11     2 Oklahoma   X    
#> 12     3 Oklahoma   X

Then I group by the row numbers, and use a stringr convenience function. str_which(value, "^X$") finds the locations where value contains the regex ^X$. Using this as the indices of state gets the entries in state that correspond to an X in value. Then I collapse those strings into a single string column, and use tidyr::spread to make it back into a wide format.

df %>%
  setNames(c("Texas", "California", "New Jersey", "Oklahoma")) %>%
  mutate(row = row_number()) %>%
  gather(key = state, value = value, -row) %>%
  group_by(row) %>%
  mutate(states = state[str_which(value, "^X$")] %>% paste(collapse = ", ")) %>%
  spread(key = state, value = value)
#> # A tibble: 3 x 6
#> # Groups:   row [3]
#>     row states                      California `New Jersey` Oklahoma Texas
#>   <int> <chr>                       <chr>      <chr>        <chr>    <chr>
#> 1     1 Texas, California, New Jer… X          X            ""       X    
#> 2     2 Texas, New Jersey, Oklahoma ""         X            X        X    
#> 3     3 Texas, Oklahoma             ""         ""           X        X

Created on 2018-10-11 by the reprex package (v0.2.1)

Upvotes: 1

divibisan
divibisan

Reputation: 12165

df <- data.frame(c1 = c(T,T,T),
                 c2 = c(T,F,F),
                 c3 = c(T,T,F),
                 c4 = c(F,T,T))

A vector with the state names in the same order as the columns the correspond to.

sts = c('Texas', 'California', 'New Jersey', "Oklahoma")

Now you can test each column to get the indices of TRUE columns, then grab the corresponding states from sts vector and paste them together.

In the example above, the data frame contains TRUE and FALSE, but if you want to use a character value (for example 'X') to select cells, just change the test in the which statement from == TRUE to == 'X', for example.

Note that this currently requires you to specify the column names. (The plus side of this is that it won't have any problems if you have additional columns that you don't want to consider)

df %>%
    rowwise() %>%
    mutate(c5 = paste0(sts[which(c(c1,c2,c3,c4) == TRUE)], collapse = ', '))

Source: local data frame [3 x 5]
Groups: <by row>

# A tibble: 3 x 5
  c1    c2    c3    c4    c5                           
  <lgl> <lgl> <lgl> <lgl> <chr>                        
1 TRUE  TRUE  TRUE  FALSE Texas, California, New Jersey
2 TRUE  FALSE TRUE  TRUE  Texas, New Jersey, Oklahoma  
3 TRUE  FALSE FALSE TRUE  Texas, Oklahoma      

Upvotes: 1

tblznbits
tblznbits

Reputation: 6776

Here is one approach that might be viable:

df = data_frame(c1='x', c2=c('x', NA, NA), c3=c('x', 'x', NA), c4=c(NA, 'x', 'x'))
values = c('TX', 'CA', 'NJ', 'OK')
df$c5 = sapply(df, function(x) !is.na(x)) %>% 
    apply(MARGIN=1, FUN=function(x) paste(values[x], collapse=', '))
df

# A tibble: 3 x 5
  c1    c2    c3    c4    c5                           
  <chr> <chr> <chr> <chr> <chr>                        
1 x     x     x     NA    Texas, California, New Jersey
2 x     NA    x     x     Texas, New Jersey, Oklahoma  
3 x     NA    NA    x     Texas, Oklahoma 

The sapply loops through the dataframe checking if the value is missing or not in order to get a matrix of TRUE/FALSE values. That matrix is then looped over, passing the row of T/F values into an anonymous function that indexes values and pastes the results. The output from the chained sapply and apply functions is a vector of the strings you're looking for equal in length to the number of rows in df. So you can just set this as your new column. Hope that makes sense.

Upvotes: 1

Related Questions