andrew_reece
andrew_reece

Reputation: 21274

Create new columns to indicate column name's position inside another string vector (with dplyr, purrr, and stringr)

Given this example data:

require(stringr)
require(tidyverse)

labels <- c("foo", "bar", "baz")
n_rows <- 4

df <- 1:n_rows %>%
  map(~ data.frame(
      block_order=paste(sample(labels, size=length(labels), replace=FALSE),
                        collapse="|"))) %>%
  bind_rows()

df
  block_order
1 foo|bar|baz
2 baz|bar|foo
3 foo|baz|bar
4 foo|bar|baz

I want to generate a column for each string in labels, which takes the value of the position of that string in the |-separated sequence in each row.

Desired output:

  block_order foo bar baz
1 foo|bar|baz   1   2   3
2 baz|bar|foo   3   2   1
3 foo|baz|bar   1   3   2
4 foo|bar|baz   1   2   3

I've been trying different variations in a dplyr/purrr setup, like this example, where I map in each value of label, and then attempt to get its position in block_order using match on str_split:

labels %>%
  map(~ df %>%
        transmute(!!.x := match(!!.x, str_split(block_order, 
                                                "\\|", 
                                                simplify=TRUE)))) %>%
  bind_cols(df, .)

But that produces unexpected output:

  block_order foo bar baz
1 foo|bar|baz   1   5   2
2 baz|bar|foo   1   5   2
3 foo|baz|bar   1   5   2
4 foo|bar|baz   1   5   2

I'm not really sure what these numbers represent, or why they're all the same.

If anyone can help me figure out (a) how to achieve my desired output in a dplyr/purrr framework and (b) why the proposed solution here gives the output it does, I'd be very appreciative.

Upvotes: 1

Views: 186

Answers (3)

alistaire
alistaire

Reputation: 43354

Unless you need to for other reasons, you don't have to fully split the string if you just identify the location of the first match for each value of labels, which regexpr will give you. mapping over labels will give a list with one element for each string in labels (so it's a quick iteration), which you can then pmap rank over to get indices. Using the *_dfr version to simplify the results to a data frame and cbinding to the original,

library(tidyverse)
set.seed(47)

labels <- c("foo", "bar", "baz")
df <- data_frame(block_order = replicate(10, paste(sample(labels), collapse = "|")))

labels %>% 
    map(~regexpr(.x, df$block_order)) %>% 
    pmap_dfr(~set_names(as.list(rank(c(...))), labels)) %>% 
    bind_cols(df, .)
#> # A tibble: 10 x 4
#>    block_order   foo   bar   baz
#>    <chr>       <dbl> <dbl> <dbl>
#>  1 baz|foo|bar    2.    3.    1.
#>  2 baz|bar|foo    3.    2.    1.
#>  3 bar|foo|baz    2.    1.    3.
#>  4 baz|foo|bar    2.    3.    1.
#>  5 foo|bar|baz    1.    2.    3.
#>  6 baz|foo|bar    2.    3.    1.
#>  7 foo|baz|bar    1.    3.    2.
#>  8 bar|baz|foo    3.    1.    2.
#>  9 baz|foo|bar    2.    3.    1.
#> 10 foo|bar|baz    1.    2.    3.

If you prefer stringr/stringi to base regex, you could to the same thing by changing the regexpr call to str_locate(df$block_order, .x)[, "start"] or stringi::stri_locate_first_fixed in the same arrangement.

Upvotes: 4

akrun
akrun

Reputation: 887611

We can split the 'block_order' by |, loop through the list of vectors using lapply, get the index with match, rbind the vectors and assign it to create new columns

labels <- c("foo", "bar", "baz")
df[labels] <- do.call(rbind, lapply(strsplit(df$block_order, "|",
         fixed = TRUE), match, table = labels))

Or similar idea with tidyverse

library(tidyverse)
str_split(df$block_order, "[|]") %>%
       map(~ .x %>% 
              match(table= labels)) %>% 
      do.call(rbind, .) %>% 
      as_tibble %>% 
      set_names(labels) %>%
      bind_cols(df, .)
#   block_order foo bar baz
#1 foo|bar|baz   1   2   3
#2 baz|bar|foo   3   2   1
#3 foo|baz|bar   1   3   2
#4 foo|bar|baz   1   2   3

Another option would be to use separate_rows, reshape it to 'long' format and spread it back

rownames_to_column(df, 'rn') %>%
    separate_rows(block_order) %>% 
    group_by(rn) %>% 
    mutate(ind = match(block_order, labels), labels = factor(labels, levels = labels)) %>%
    select(-block_order) %>%
    spread(labels, ind) %>% 
    ungroup %>%
    select(-rn) %>% 
    bind_cols(df, .)

Upvotes: 5

Nick DiQuattro
Nick DiQuattro

Reputation: 739

I think this might work:

library(tidyr)
library(purrr)
position_counter <- function(...) {
  row = list(...)
  row %>% map(~which(row == .)) %>% setNames(row)
}

df %>%
  separate(block_order, labels) %>% 
  pmap_df(position_counter)

Upvotes: 1

Related Questions