C.Robin
C.Robin

Reputation: 1102

Accessing indexes of str_locate and str_locate_all

Say I have this dataframe:

df <- structure(list(gender_age = c("males_rating_all_ages", "males_rating_<18", 
"males_rating_18-29", "males_rating_30-44", "males_rating_45+", 
"males_count_all_ages", "males_count_<18", "males_count_18-29", 
"males_count_30-44", "males_count_45+", "females_rating_all_ages", 
"females_rating_<18", "females_rating_18-29", "females_rating_30-44", 
"females_rating_45+"), count = c("7.4", "8.0", "7.5", "7.2", 
"7.5", "4,197", "15", "1,276", "1,631", "921", "7.8", "8.7", 
"7.7", "7.8", "8.1")), row.names = c(NA, -15L), class = c("tbl_df", 
"tbl", "data.frame"))

and I want to extract the gender, age, and type (i.e. count or rating) of the gender_age column and put them in their own columns.

I have this code so far:

df %>% mutate(gender = str_sub(.$gender_age, 1, str_locate(.$gender_age, "_")[1,]-1)) %>% 
  mutate(age    = str_sub(.$gender_age, str_locate_all(.$gender_age, "_")[[1]][2,], str_length(.$gender_age)))

# A tibble: 15 x 4
   gender_age              count gender age        
   <chr>                   <chr> <chr>  <chr>      
 1 males_rating_all_ages   7.4   males  _all_ages  
 2 males_rating_<18        8.0   males  _<18       
 3 males_rating_18-29      7.5   males  _18-29     
 4 males_rating_30-44      7.2   males  _30-44     
 5 males_rating_45+        7.5   males  _45+       
 6 males_count_all_ages    4,197 males  all_ages   
 7 males_count_<18         15    males  <18        
 8 males_count_18-29       1,276 males  18-29      
 9 males_count_30-44       1,631 males  30-44      
10 males_count_45+         921   males  45+        
11 females_rating_all_ages 7.8   femal  ng_all_ages
12 females_rating_<18      8.7   femal  ng_<18     
13 females_rating_18-29    7.7   femal  ng_18-29   
14 females_rating_30-44    7.8   femal  ng_30-44   
15 females_rating_45+      8.1   femal  ng_45+ 
    
Warning messages:
1: Problem with `mutate()` column `gender`.
ℹ `gender = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length 
2: Problem with `mutate()` column `age`.
ℹ `age = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length 

but as you can see it indexes on the same fixed value for str_locate_all() for every row of the data. Obviously this isn't ideal as the number of characters before the second underscore _ varies.

for example:

> str_locate_all("males_rating_all_ages", "_")
[[1]]
     start end
[1,]     6   6
[2,]    13  13
[3,]    17  17

so I have to index first on [[1]] and then the specific row of the matrix (in my case [2,] to get just the one value which I can feed in to the str_sub() expression.

but if i run:

> str_locate_all("females_rating_all_ages", "_")
[[1]]
     start end
[1,]     8   8
[2,]    15  15
[3,]    19  19

we can see that when there are more characters in front of the underscores the matrix indicates this as such. However, for the new column I created in the mutate function it seems to have taken the first row's index for all subsequent rows.

Can anyone see what i'm doing wrong here? Or propose an alternative way to extract the three columns I want from gender_age (ideally using str_ functions)?

Upvotes: 2

Views: 565

Answers (1)

akrun
akrun

Reputation: 887591

Instead of using str_locate, it may be easier with extract that capture groups based on the regex pattern

library(dplyr)
library(stringr)
df %>%
    extract(gender_age, into = c("gender", "age"), 
         "^([^_]+)_[^_]+_(.*)", remove = FALSE)

-ouptut

# A tibble: 15 x 4
   gender_age              gender  age      count
   <chr>                   <chr>   <chr>    <chr>
 1 males_rating_all_ages   males   all_ages 7.4  
 2 males_rating_<18        males   <18      8.0  
 3 males_rating_18-29      males   18-29    7.5  
 4 males_rating_30-44      males   30-44    7.2  
 5 males_rating_45+        males   45+      7.5  
 6 males_count_all_ages    males   all_ages 4,197
 7 males_count_<18         males   <18      15   
 8 males_count_18-29       males   18-29    1,276
 9 males_count_30-44       males   30-44    1,631
10 males_count_45+         males   45+      921  
11 females_rating_all_ages females all_ages 7.8  
12 females_rating_<18      females <18      8.7  
13 females_rating_18-29    females 18-29    7.7  
14 females_rating_30-44    females 30-44    7.8  
15 females_rating_45+      females 45+      8.1  

The issue in the OP's code is selecting the first list element with [[ for str_locate_all. If the list is of length 1, it works, but, here the list length is the same as the number of rows of the data and thuse [[1]] would select the first row observation . This can be rectified either using rowwise before the mutate step

df %>%
   rowwise %>%
   mutate(gender = str_sub(gender_age, 1, str_locate(gender_age, "_")[1,1]-1)) %>% 
    mutate(age    = str_sub(gender_age, str_locate_all(gender_age, 
          "_")[[1]][2,1]+1, str_length(gender_age)))
# A tibble: 15 x 4
# Rowwise: 
   gender_age              count gender  age     
   <chr>                   <chr> <chr>   <chr>   
 1 males_rating_all_ages   7.4   males   all_ages
 2 males_rating_<18        8.0   males   <18     
 3 males_rating_18-29      7.5   males   18-29   
 4 males_rating_30-44      7.2   males   30-44   
 5 males_rating_45+        7.5   males   45+     
 6 males_count_all_ages    4,197 males   all_ages
 7 males_count_<18         15    males   <18     
 8 males_count_18-29       1,276 males   18-29   
 9 males_count_30-44       1,631 males   30-44   
10 males_count_45+         921   males   45+     
11 females_rating_all_ages 7.8   females all_ages
12 females_rating_<18      8.7   females <18     
13 females_rating_18-29    7.7   females 18-29   
14 females_rating_30-44    7.8   females 30-44   
15 females_rating_45+      8.1   females 45+     

and then remove the .$ (which selects the entire column) or another option is to loop over the list with map get the column of interest from the matrix output

Upvotes: 1

Related Questions