Reputation: 463
I would like to replace unique values with an index number using dplyr::mutate.
I am grouping by a couple of different variables to access the appropriate subset of my dataframe.
head(df)
group start_time end_time
1 group1 0 0.4
2 group1 0 0.4
3 group1 0 0.4
4 group1 0.4 0.8
5 group1 0.4 0.8
6 group2 0.0 0.4
7 group2 0.4 0.8
8 group2 0.8 1.02
I group_by 'group,' and then by 'start_time.' Sometimes a given group has only one start_time, sometimes two start_times, or sometimes three. I need to create a new variable, 'idx,' for each unique start_time. But I can't think how to do it.
new_df <- df %>%
group_by(group, start_time) %>%
mutate(idx = row_number()) %>%
as.data.frame
Creating a new variable using row_number() isn't right. It gives me:
idx
1
2
3
1
2
1
1
1
But I want:
idx
1
1
1
2
2
1
2
3
I thought of replacing each unique value in group_by with a number? And repeating?
Upvotes: 1
Views: 2394
Reputation: 28705
Another option is data.table::frank
(short for fast rank)
df %>%
group_by(group) %>%
mutate(idx = data.table::frank(start_time, ties.method = 'dense'))
# # A tibble: 8 x 4
# # Groups: group [2]
# group start_time end_time idx
# <chr> <dbl> <dbl> <int>
# 1 group1 0 0.4 1
# 2 group1 0 0.4 1
# 3 group1 0 0.4 1
# 4 group1 0.4 0.8 2
# 5 group1 0.4 0.8 2
# 6 group2 0 0.4 1
# 7 group2 0.4 0.8 2
# 8 group2 0.8 1.02 3
Upvotes: 1
Reputation: 887731
We can use match
after grouping by 'group'
library(tidyverse)
df %>%
group_by(group) %>%
mutate(idx = match(start_time, unique(start_time)))
# A tibble: 8 x 4
# Groups: group [2]
# group start_time end_time idx
# <chr> <dbl> <dbl> <int>
#1 group1 0 0.4 1
#2 group1 0 0.4 1
#3 group1 0 0.4 1
#4 group1 0.4 0.8 2
#5 group1 0.4 0.8 2
#6 group2 0 0.4 1
#7 group2 0.4 0.8 2
#8 group2 0.8 1.02 3
Or another option is group_indices
df %>%
group_split(group) %>%
map_df(~ .x %>%
mutate(idx = group_indices(., start_time)))
NOTE: If the 'idx' needs to be created outside the 'group', then remove the group_by
step
NOTE2: In the OP's example, both (with/without group_by
) gives the same output
Upvotes: 6
Reputation: 12165
We can actually do this easily using R's factor type. A factor
variable is stored as integers that refer to a table of levels which holds the actual values. We can then use as.integer
or as.numeric
to convert from factor back to a number. When you do that, the levels table is lost and you're left with only the integers that would refer back to it; normally this is undesired (you want your actual values, not the encoded values) but in this case it's desirable since identical values will be encoded with the same number:
df <- structure(list(group = c("group1", "group1", "group1", "group1",
"group1", "group2", "group2", "group2"), start_time = c(0, 0,
0, 0.4, 0.4, 0, 0.4, 0.8), end_time = c(0.4, 0.4, 0.4, 0.8, 0.8,
0.4, 0.8, 1.02)), class = "data.frame", row.names = c(NA, -8L
))
df %>%
mutate(idx = as.integer(factor(start_time)))
group start_time end_time idx
1 group1 0.0 0.40 1
2 group1 0.0 0.40 1
3 group1 0.0 0.40 1
4 group1 0.4 0.80 2
5 group1 0.4 0.80 2
6 group2 0.0 0.40 1
7 group2 0.4 0.80 2
8 group2 0.8 1.02 3
As an added benefit, this works just as well in base R:
df$idx <- as.integer(factor(df$start_time))
df
group start_time end_time idx
1 group1 0.0 0.40 1
2 group1 0.0 0.40 1
3 group1 0.0 0.40 1
4 group1 0.4 0.80 2
5 group1 0.4 0.80 2
6 group2 0.0 0.40 1
7 group2 0.4 0.80 2
8 group2 0.8 1.02 3
Upvotes: 3