melbez
melbez

Reputation: 1000

Sorting multiple columns by first letter and by numbers in R

I have created a dataframe that looks like the following:

item  mean
a_b   5
a_c   2
a_a   4
b_d   7
b_f   3
b_e   1

I would like to sort it so that it is first sorted by whether or not it begins with "a_" or "b_", and then have it sorted by mean. The final dataframe should look like this:

item  mean
a_c   2
a_a   4
a_b   5
b_e   1
b_f   3
b_d   7

Note that the item column is not sorted perfectly alphabetically. It is only sorted by the first letter.

I have tried:

arrange(df, item, mean) 

The problem with this is that it does not only sort by the "a_" and "b_" categories, but by the entire item name.

I am open to separating the original dataframe into separate dataframes using filter and then sorting the mean within these smaller subsets. I do not need everything to stay in the same dataframe. However, I am unsure how to use filter to only select rows that have items beginning with "a_" or "b_".

Upvotes: 1

Views: 2424

Answers (3)

AntoniosK
AntoniosK

Reputation: 16121

The philosophy is that if you want to arrange by something (i.e. a substring here) you have to obtain it first:

df = read.table(text = "
item  mean
a_b   5
a_c   2
a_a   4
b_d   7
b_f   3
b_e   1
", header=T, stringsAsFactors=F)

library(tidyverse)

df %>%
  separate(item, c("item1","item2"), remove = F) %>% # split items while keeping the original column
  arrange(item1, mean) %>%                           # arrange by what you really want
  select(item, mean)                                 # keep only relevant columns

#   item mean
# 1  a_c    2
# 2  a_a    4
# 3  a_b    5
# 4  b_e    1
# 5  b_f    3
# 6  b_d    7

Note that there are various ways to pick the 1st letter from a string. I just decided to use separate here.

In case you have many items separated by _ you'll still need to extract the first item, so you can replace the first _ with another delimiter (let's say :) and separate your column on that:

df = read.table(text = "
item  mean
a_b_m   5
a_c   2
a_a   4
b_d_x_q   7
b_f   3
b_e   1
", header=T, stringsAsFactors=F)

library(tidyverse)
library(stringr)

df %>%
  mutate(item2 = str_replace(item, "_", ":")) %>%
  separate(item2, c("item1","item2"), remove = F, sep = ":") %>% 
  arrange(item1, mean) %>%                           
  select(item, mean) 

#      item mean
# 1     a_c    2
# 2     a_a    4
# 3   a_b_m    5
# 4     b_e    1
# 5     b_f    3
# 6 b_d_x_q    7

Upvotes: 1

acylam
acylam

Reputation: 18661

Another method using dplyr:

library(dplyr)
arrange(df, sub('_.+$', '', item), mean)

an alternative would be to use str_extract from stringr to extract only the first letter from item:

library(stringr)
arrange(df, str_extract(item, '^._'), mean)

Result:

  item mean
1  a_c    2
2  a_a    4
3  a_b    5
4  b_e    1
5  b_f    3
6  b_d    7

Data:

df <- structure(list(item = c("a_b", "a_c", "a_a", "b_d", "b_f", "b_e"
), mean = c(5L, 2L, 4L, 7L, 3L, 1L)), .Names = c("item", "mean"
), class = "data.frame", row.names = c(NA, -6L))

Notes:

  • sub('_.+$', '', item) creates a temporary variable by removing _ and everything after that from item. _.+$ matches a literal underscore (_) followed by any character one or more times (.+) at the end of the string ($).

  • str_extract(item, '^._') creates a temporary variable by extracting any one character (.) followed by a literal underscore (_) in the beginning of the string (^)

  • The neat thing about dplyr::arrange is that you can create a temporary sorting variable within the function and not have it included in the output.

Upvotes: 3

Rui Barradas
Rui Barradas

Reputation: 76402

A base R solution would be

inx <- order(substr(df$item, 1, 1), df$mean)
newdf <- df[inx, ]

newdf
#  item mean
#2  a_c    2
#3  a_a    4
#1  a_b    5
#6  b_e    1
#5  b_f    3
#4  b_d    7

Upvotes: 0

Related Questions