C.Robin
C.Robin

Reputation: 1102

Extract part of string values, make new column names, and make dataframe wide

I want to extract the last part of a string column (always wrapped in square brackets), make these the names of new columns, and then reshape the data from long to wide and fill the new columns with these values.

For example, if I have this dataframe:

whatihave <- data_frame(v1 = c('abc [effort]', 'def [effort]', 'ghi [effort]', 'abc [scope]', 'def [scope]', 'ghi [scope]'),
                        scores = c(1:6))

# A tibble: 6 x 2
  v1           scores
  <chr>         <int>
1 abc [effort]      1
2 def [effort]      2
3 ghi [effort]      3
4 abc [scope]       4
5 def [scope]       5
6 ghi [scope]       6

I want to transform it into this dataframe:

whatiwant <- data_frame(v1 = c('abc', 'def', 'ghi'), 
                        effort = c(1, 2, 3),
                        scope = c(4, 5, 6))

  v1    effort scope
  <chr>  <dbl> <dbl>
1 abc        1     4
2 def        2     5
3 ghi        3     6

As you can see, the characters inside the square brackets at the end of the values in the v1 column have become the names of two new variables (effort and scope). The values from the scores column have then filled the new columns i've made.

How do I do this?

EDIT BELOW

The toy data I mocked up was missing a key feature of my actual data. There are actually multiple instances of the same v1 value and with the same suffix in square brackets.

As a result the answer below (which is otherwise excellent) results in lists of cols in each cell of effort and scope instead of individual values.

Let's imagine I have this data instead:

whatiactuallyhave <- data_frame(v1 = c('abc [effort]', 'abc [effort]', 'def [effort]', 'def [effort]', 'ghi [effort]', 'abc [scope]', 'abc [scope]', 'def [scope]', 'ghi [scope]', 'ghi [scope]'), 
                        scores = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10'))

# A tibble: 10 x 2
   v1           scores
   <chr>        <chr> 
 1 abc [effort] 1     
 2 abc [effort] 2     
 3 def [effort] 3     
 4 def [effort] 4     
 5 ghi [effort] 5     
 6 abc [scope]  6     
 7 abc [scope]  7     
 8 def [scope]  8     
 9 ghi [scope]  9     
10 ghi [scope]  10    

And that I want to transform it to this:

whatiactuallywant <- data_frame(v1 = c('abc', 'abc', 'def', 'def', 'ghi', 'ghi'), 
                        effort = c('1', '2', '3', '4', '5', 'NA'),
                        scope = c('6', '7', '8', 'NA', '9', '10'))

# A tibble: 6 x 3
  v1    effort scope
  <chr> <chr>  <chr>
1 abc   1      6    
2 abc   2      7    
3 def   3      8    
4 def   4      NA   
5 ghi   5      9    
6 ghi   NA     10   

I hope that's clearer now! Thanks so much for your help.

Upvotes: 2

Views: 1213

Answers (2)

Anoushiravan R
Anoushiravan R

Reputation: 21918

This solution is very similar to the one posted by my dear friend @AnilGoyal, as only the regex is different:

library(dplyr)
library(stingr)
library(tidyr)

whatiactuallyhave %>% 
  extract(v1, c("v1", "v2"), "([a-z]+) ([[a-z]+])") %>%
  mutate(v2 = str_replace(v2, "\\[(\\w+)\\]", "\\1")) %>%
  group_by(v1, v2) %>%
  mutate(id = row_number()) %>%
  pivot_wider(names_from = v2, values_from = scores) %>%
  select(-id)

# A tibble: 6 x 3
# Groups:   v1 [3]
  v1    effort scope
  <chr> <chr>  <chr>
1 abc   1      6    
2 abc   2      7    
3 def   3      8    
4 def   4      NA   
5 ghi   5      9    
6 ghi   NA     10  

Upvotes: 2

AnilGoyal
AnilGoyal

Reputation: 26218

Revised scenario

  • Using tidyr::extract will lead you saving one extra step of mutate as you can directly extract two desired strings into two columns using regex here.
library(tidyverse)
whatiactuallyhave <- data_frame(v1 = c('abc [effort]', 'abc [effort]', 'def [effort]', 'def [effort]', 'ghi [effort]', 'abc [scope]', 'abc [scope]', 'def [scope]', 'ghi [scope]', 'ghi [scope]'), 
                                scores = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10'))
#> Warning: `data_frame()` was deprecated in tibble 1.1.0.
#> Please use `tibble()` instead.

whatiactuallyhave %>%
  tidyr::extract(v1, into = c('v1', 'name'), regex = '(\\w+)\\s\\[(\\w+)\\]') %>%
  group_by(v1, name) %>%
  mutate(d = row_number()) %>%
  pivot_wider(names_from = name, values_from = scores, values_fill = NA) %>%
  select(-d)

#> # A tibble: 6 x 3
#> # Groups:   v1 [3]
#>   v1    effort scope
#>   <chr> <chr>  <chr>
#> 1 abc   1      6    
#> 2 abc   2      7    
#> 3 def   3      8    
#> 4 def   4      <NA> 
#> 5 ghi   5      9    
#> 6 ghi   <NA>   10

Created on 2021-05-26 by the reprex package (v2.0.0)


Earlier answer

whatihave <- data_frame(v1 = c('abc [effort]', 'def [effort]', 'ghi [effort]', 'abc [scope]', 'def [scope]', 'ghi [scope]'),
                        scores = c(1:6))

library(tidyverse)
whatihave %>%
  separate(v1, into = c('v1', 'name'), sep = ' \\[') %>%
  mutate(name = str_remove(name, '\\]')) %>%
  pivot_wider(names_from = name, values_from = scores)


# A tibble: 3 x 3
  v1    effort scope
  <chr>  <int> <int>
1 abc        1     4
2 def        2     5
3 ghi        3     6

Upvotes: 4

Related Questions