Reputation:

Prefix labels of a categorical column with numbers based on an order

Consider a data frame with two numeric columns (x, y) and a categorical column (category) containing a string:

d1 <- data.frame(x = c(0, 1, 2, 5, 6.5, 8), y = c(0, 2, 3, 5, 5.5, 5), category = "A")
d2 <- data.frame(x = c(0, 1, 2, 4, 6, 8), y = c(0, 3, 3.5, 4, 4, 5), category = "B")
d3 <- data.frame(x = c(0, 1.5, 3, 4, 6.5, 8), y = c(0, 2.5, 3, 4, 4.5, 5), category = "C")
df <- rbind(d1, d2, d3)

> df
     x   y category
1  0.0 0.0        A
2  1.0 2.0        A
3  2.0 3.0        A
4  5.0 5.0        A
5  6.5 5.5        A
6  8.0 5.0        A
7  0.0 0.0        B
8  1.0 3.0        B
9  2.0 3.5        B
10 4.0 4.0        B
11 6.0 4.0        B
12 8.0 5.0        B
13 0.0 0.0        C
14 1.5 2.5        C
15 3.0 3.0        C
16 4.0 4.0        C
17 6.5 4.5        C
18 8.0 5.0        C

Based on the order of the sum of the last 3 y values for each distinct category value, I would like to append a number plus a dot as a prefix to the values of the category column. For the above example, those sums in ascending order would be 13, 13.5 and 15.5 for B, C and A respectively. The number to be prefixed to the categorical column values should correspondingly represent this order in accordance with the category label, i.e. 1.B, 2.C, 3.A should be the renamed values in category for the former B, C and A respectively. My desired output is shown below:

     x   y category
1  0.0 0.0      3.A
2  1.0 2.0      3.A
3  2.0 3.0      3.A
4  5.0 5.0      3.A
5  6.5 5.5      3.A
6  8.0 5.0      3.A
7  0.0 0.0      1.B
8  1.0 3.0      1.B
9  2.0 3.5      1.B
10 4.0 4.0      1.B
11 6.0 4.0      1.B
12 8.0 5.0      1.B
13 0.0 0.0      2.C
14 1.5 2.5      2.C
15 3.0 3.0      2.C
16 4.0 4.0      2.C
17 6.5 4.5      2.C
18 8.0 5.0      2.C

I'm able to collect the ordered sum values with corresponding category values in a list, but am unable to think past this:

endsum <- list()
diff <- unique(df$category)
for(i in 1:length(diff)) {
  endsum[diff[i]] <- sum(tail(subset(df, category == diff[i])$y, 3))
}
endsum <- endsum[order(sapply(endsum, function(a) a))]

> endsum
$B
[1] 13

$C
[1] 13.5

$A
[1] 15.5

I did think of comparing names(endsum)[i] with df$category while traversing df row-wise and for matches, append the number maintained in the list order along with a dot, i.e. for e.g. when B from names(endsum)[1] gets matched with the data frame's category label B, or rows 7-12 of df, it will take its order in the list i which is 1, and prefix it (with a dot) to rows 7-12 of df, turning the B into 1.B.

However, am unsure of its implementation and this does not seem to be the best approach. Could anyone suggest a better way to achieve this? Help will be appreciated!

Upvotes: 3

Answers (3)

AnilGoyal

Reputation: 26218

A one liner base R solution. Since there is no ties.method = "dense" in rank() in r, I first converted it to factor and then as.integer to return the dense ranks.

df$category <- paste(as.integer(factor(rank(ave(df$y, df$category, FUN = function(x){n <- length(x); sum(x[(n-2):n])}), 
     ties.method = "min"))), df$category, sep = ".")

df
     x   y category
1  0.0 0.0      3.A
2  1.0 2.0      3.A
3  2.0 3.0      3.A
4  5.0 5.0      3.A
5  6.5 5.5      3.A
6  8.0 5.0      3.A
7  0.0 0.0      1.B
8  1.0 3.0      1.B
9  2.0 3.5      1.B
10 4.0 4.0      1.B
11 6.0 4.0      1.B
12 8.0 5.0      1.B
13 0.0 0.0      2.C
14 1.5 2.5      2.C
15 3.0 3.0      2.C
16 4.0 4.0      2.C
17 6.5 4.5      2.C
18 8.0 5.0      2.C

Upvotes: 1

lil_barnacle

Reputation: 168

Another soluntion using data.table:

library(data.table)
dt <- data.table(df)

# Create sum of last 3 rows and rank the sum
last3 <- dt[, .SD[c(.N-2, .N-1, .N)], by="category"]
order <- last3[ , .(endsum=sum(y)), by="category"][ , rank := rank(endsum)]

# Merge rank to dt and create prefix
dt_final <- merge(dt, order, by="category")
dt_final[ , category := paste0(rank, ".", category)]

Upvotes: 0

Ronak Shah

Reputation: 389012

For each group keep the last 3 rows and sum their y values, arrange the data based on the sum and paste the row number to the category column.

library(dplyr)

df %>%
  group_by(category) %>%
  slice_tail(n = 3) %>%
  summarise(y = sum(y)) %>%
  arrange(y) %>%
  mutate(new_category = paste(row_number(), category, sep = '.')) %>%
  select(-y) %>%
  left_join(df, by = 'category')

#  category new_category     x     y
#   <chr>    <chr>        <dbl> <dbl>
# 1 B        1.B            0     0  
# 2 B        1.B            1     3  
# 3 B        1.B            2     3.5
# 4 B        1.B            4     4  
# 5 B        1.B            6     4  
# 6 B        1.B            8     5  
# 7 C        2.C            0     0  
# 8 C        2.C            1.5   2.5
# 9 C        2.C            3     3  
#10 C        2.C            4     4  
#11 C        2.C            6.5   4.5
#12 C        2.C            8     5  
#13 A        3.A            0     0  
#14 A        3.A            1     2  
#15 A        3.A            2     3  
#16 A        3.A            5     5  
#17 A        3.A            6.5   5.5
#18 A        3.A            8     5

Probably this one is shorter with the same logic :

library(dplyr)

df %>%
  group_by(category) %>%
  mutate(sum_y = sum(tail(y, 3))) %>%
  ungroup %>%
  arrange(sum_y) %>%
  mutate(category = paste(match(category, unique(category)),category, sep = '.'))

In base R this can be written as :

df1 <- aggregate(y~category, df, function(x) sum(tail(x, 3)))
df1 <- df1[order(df1$y), -2, drop = FALSE]
df1$new_category <- paste(seq(nrow(df1)), df1$category, sep = '.')
merge(df1, df, by = 'category')

Upvotes: 0

Prefix labels of a categorical column with numbers based on an order

Answers (3)

Related Questions