Reputation: 1102
Say I have this dataframe:
df <- structure(list(gender_age = c("males_rating_all_ages", "males_rating_<18",
"males_rating_18-29", "males_rating_30-44", "males_rating_45+",
"males_count_all_ages", "males_count_<18", "males_count_18-29",
"males_count_30-44", "males_count_45+", "females_rating_all_ages",
"females_rating_<18", "females_rating_18-29", "females_rating_30-44",
"females_rating_45+"), count = c("7.4", "8.0", "7.5", "7.2",
"7.5", "4,197", "15", "1,276", "1,631", "921", "7.8", "8.7",
"7.7", "7.8", "8.1")), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
and I want to extract the gender, age, and type (i.e. count
or rating
) of the gender_age
column and put them in their own columns.
I have this code so far:
df %>% mutate(gender = str_sub(.$gender_age, 1, str_locate(.$gender_age, "_")[1,]-1)) %>%
mutate(age = str_sub(.$gender_age, str_locate_all(.$gender_age, "_")[[1]][2,], str_length(.$gender_age)))
# A tibble: 15 x 4
gender_age count gender age
<chr> <chr> <chr> <chr>
1 males_rating_all_ages 7.4 males _all_ages
2 males_rating_<18 8.0 males _<18
3 males_rating_18-29 7.5 males _18-29
4 males_rating_30-44 7.2 males _30-44
5 males_rating_45+ 7.5 males _45+
6 males_count_all_ages 4,197 males all_ages
7 males_count_<18 15 males <18
8 males_count_18-29 1,276 males 18-29
9 males_count_30-44 1,631 males 30-44
10 males_count_45+ 921 males 45+
11 females_rating_all_ages 7.8 femal ng_all_ages
12 females_rating_<18 8.7 femal ng_<18
13 females_rating_18-29 7.7 femal ng_18-29
14 females_rating_30-44 7.8 femal ng_30-44
15 females_rating_45+ 8.1 femal ng_45+
Warning messages:
1: Problem with `mutate()` column `gender`.
ℹ `gender = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length
2: Problem with `mutate()` column `age`.
ℹ `age = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length
but as you can see it indexes on the same fixed value for str_locate_all()
for every row of the data. Obviously this isn't ideal as the number of characters before the second underscore _
varies.
for example:
> str_locate_all("males_rating_all_ages", "_")
[[1]]
start end
[1,] 6 6
[2,] 13 13
[3,] 17 17
so I have to index first on [[1]]
and then the specific row of the matrix (in my case [2,]
to get just the one value which I can feed in to the str_sub()
expression.
but if i run:
> str_locate_all("females_rating_all_ages", "_")
[[1]]
start end
[1,] 8 8
[2,] 15 15
[3,] 19 19
we can see that when there are more characters in front of the underscores the matrix indicates this as such. However, for the new column I created in the mutate
function it seems to have taken the first row's index for all subsequent rows.
Can anyone see what i'm doing wrong here? Or propose an alternative way to extract the three columns I want from gender_age
(ideally using str_ functions)?
Upvotes: 2
Views: 565
Reputation: 887591
Instead of using str_locate
, it may be easier with extract
that capture groups based on the regex pattern
library(dplyr)
library(stringr)
df %>%
extract(gender_age, into = c("gender", "age"),
"^([^_]+)_[^_]+_(.*)", remove = FALSE)
-ouptut
# A tibble: 15 x 4
gender_age gender age count
<chr> <chr> <chr> <chr>
1 males_rating_all_ages males all_ages 7.4
2 males_rating_<18 males <18 8.0
3 males_rating_18-29 males 18-29 7.5
4 males_rating_30-44 males 30-44 7.2
5 males_rating_45+ males 45+ 7.5
6 males_count_all_ages males all_ages 4,197
7 males_count_<18 males <18 15
8 males_count_18-29 males 18-29 1,276
9 males_count_30-44 males 30-44 1,631
10 males_count_45+ males 45+ 921
11 females_rating_all_ages females all_ages 7.8
12 females_rating_<18 females <18 8.7
13 females_rating_18-29 females 18-29 7.7
14 females_rating_30-44 females 30-44 7.8
15 females_rating_45+ females 45+ 8.1
The issue in the OP's code is selecting the first list
element with [[
for str_locate_all
. If the list
is of length
1, it works, but, here the list
length is the same as the number of rows of the data and thuse [[1]]
would select the first row observation . This can be rectified either using rowwise
before the mutate
step
df %>%
rowwise %>%
mutate(gender = str_sub(gender_age, 1, str_locate(gender_age, "_")[1,1]-1)) %>%
mutate(age = str_sub(gender_age, str_locate_all(gender_age,
"_")[[1]][2,1]+1, str_length(gender_age)))
# A tibble: 15 x 4
# Rowwise:
gender_age count gender age
<chr> <chr> <chr> <chr>
1 males_rating_all_ages 7.4 males all_ages
2 males_rating_<18 8.0 males <18
3 males_rating_18-29 7.5 males 18-29
4 males_rating_30-44 7.2 males 30-44
5 males_rating_45+ 7.5 males 45+
6 males_count_all_ages 4,197 males all_ages
7 males_count_<18 15 males <18
8 males_count_18-29 1,276 males 18-29
9 males_count_30-44 1,631 males 30-44
10 males_count_45+ 921 males 45+
11 females_rating_all_ages 7.8 females all_ages
12 females_rating_<18 8.7 females <18
13 females_rating_18-29 7.7 females 18-29
14 females_rating_30-44 7.8 females 30-44
15 females_rating_45+ 8.1 females 45+
and then remove the .$
(which selects the entire column) or another option is to loop over the list
with map
get the column of interest from the matrix
output
Upvotes: 1