Mina
Mina

Reputation: 111

Group similar strings with numbers and keep order of first appearance

I have a dataframe which looks like this example (just much larger):

var <- c('Peter','Ben','Mary','Peter.1','Ben.1','Mary.1','Peter.2','Ben.2','Mary.2')
v1 <- c(0.4, 0.6, 0.7, 0.3, 0.9, 0.2, 0.4, 0.6, 0.7)
v2 <- c(0.5, 0.4, 0.2, 0.5, 0.4, 0.2, 0.1, 0.4, 0.2)
df <- data.frame(var, v1, v2)

      var  v1  v2
1   Peter 0.4 0.5
2     Ben 0.6 0.4
3    Mary 0.7 0.2
4 Peter.1 0.3 0.5
5   Ben.1 0.9 0.4
6  Mary.1 0.2 0.2
7 Peter.2 0.4 0.1
8   Ben.2 0.6 0.4
9  Mary.2 0.7 0.2

I want to group the strings in 'var' according to the names without the suffixes, and keep the original order of first appearance. Desired output:

      var  v1  v2
1 Peter   0.4 0.5 # Peter appears first in the original data
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4 # Ben appears second in the original data
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2 # Mary appears third in the original data
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

How can I achieve that?

Thank you!

Upvotes: 2

Views: 199

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269644

separate the var column into two columns, replace the NAs that get generated with 0, sort and remove the extra columns.

This works on the numeric value of the numbers rather than the character representation so that for example, 10 won't come before 2. Also, the match in arrange ensures that the order is based on the first occurrence order.

df %>%
  separate(var, c("alpha", "no"), convert=TRUE, remove=FALSE, fill="right") %>% 
  mutate(no = replace_na(no, 0)) %>%
  arrange(match(alpha, alpha), no) %>%
  select(-alpha, -no)

giving

      var  v1  v2
1   Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

Update

Have removed what was previously the first solution after reading the update to the question.

Upvotes: 1

Henrik
Henrik

Reputation: 67778

Compact option with sub and data.table::chgroup

df[chgroup(sub("\\..", "", df$var)),]

      var  v1  v2
1   Peter 0.4 0.5
4 Peter.1 0.3 0.5
7 Peter.2 0.4 0.1
2     Ben 0.6 0.4
5   Ben.1 0.9 0.4
8   Ben.2 0.6 0.4
3    Mary 0.7 0.2
6  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

chgroup groups together duplicated values but retains the group order (according the first appearance order of each group), efficiently

Upvotes: 2

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

If you don't mind that the values in var are ordered alphabetically, then the simplest solution is this:

df %>%
  arrange(var)
      var  v1  v2
1     Ben 0.6 0.4
2   Ben.1 0.9 0.4
3   Ben.2 0.6 0.4
4    Mary 0.7 0.2
5  Mary.1 0.2 0.2
6  Mary.2 0.7 0.2
7   Peter 0.4 0.5
8 Peter.1 0.3 0.5
9 Peter.2 0.4 0.1

Upvotes: 1

akrun
akrun

Reputation: 887128

An option is to create a temporary column without the . and the digits (\\d+) at the end with str_remove, then use factor with levels specified as the unique values or use match to arrange the data

library(dplyr)
library(stringr)
df <- df %>%
   mutate(var1 = str_remove(var, "\\.\\d+$")) %>% 
   arrange(factor(var1, levels = unique(var1))) %>%
   select(-var1)

Or use fct_inorder from forcats which will convert to factor with levels in the order of first appearance

library(forcats)
df %>% 
   arrange(fct_inorder(str_remove(var, "\\.\\d+$")))

-output

     var  v1  v2
1   Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4     Ben 0.6 0.4
5   Ben.1 0.9 0.4
6   Ben.2 0.6 0.4
7    Mary 0.7 0.2
8  Mary.1 0.2 0.2
9  Mary.2 0.7 0.2

Upvotes: 1

Related Questions