Rosanne
Rosanne

Reputation: 660

Order list of strings by last number in string in R

I have the following list:

datalist <- c("20191107_1545_28.xlsx","20191108_1520_95.xlsx",""20191108_1104_99.xlsx"","20200127_1505_28.xlsx", "20200124_1505_41B.xlsx", "20200122_1505_1.xlsx", "20191102_1520_102.xlsx")

which I want to order by the last number, and then by the first number(date), so that is looks like:

"20200122_1505_1.xlsx" "20191107_1545_28.xlsx" "20200127_1505_28.xlsx" "20200124_1505_41B.xlsx" "20191108_1520_95.xlsx" "20191104_1106_99.xlsx" "20191102_1520_102.xlsx"

I have been playing around with StrReverse, so I could then just order it normally, but unfortunately, it of course also reverses the number. I tried to split the string first:

split=str_split(datalist, "_")

but I don't know how to continue. The the number that I want to order with could be 1, 2 or 3 digits and could also contain a B (like in the example). Does anyone know how to fix this? Thank in advance!

Upvotes: 1

Views: 256

Answers (2)

tmfmnk
tmfmnk

Reputation: 39858

One stringr option could be:

datalist[str_order(str_extract_all(datalist, "\\d+", simplify = TRUE)[, 3], numeric = TRUE)]

[1] "20200122_1505_1.xlsx"   "20191107_1545_28.xlsx"  "20200127_1505_28.xlsx" 
[4] "20200124_1505_41B.xlsx" "20191108_1520_95.xlsx"  "20191108_1104_99.xlsx" 
[7] "20191102_1520_102.xlsx"

Or a more flexible option:

datalist[str_order(sapply(str_extract_all(datalist, "\\d+"), tail, 1), numeric = TRUE)]

If you indeed want to order according multiple numbers, with the addition of dplyr:

bind_cols(datalist = datalist, 
          as.data.frame(str_extract_all(datalist, "\\d+", simplify = TRUE))) %>%
 mutate_at(vars(starts_with("V")), ~ as.numeric(as.character(.))) %>%
 arrange(V3, V1)

  datalist                     V1    V2    V3
  <chr>                     <dbl> <dbl> <dbl>
1 20200122_1505_1.xlsx   20200122  1505     1
2 20191107_1545_28.xlsx  20191107  1545    28
3 20200127_1505_28.xlsx  20200127  1505    28
4 20200124_1505_41B.xlsx 20200124  1505    41
5 20191108_1520_95.xlsx  20191108  1520    95
6 20191108_1104_99.xlsx  20191108  1104    99
7 20191102_1520_102.xlsx 20191102  1520   102

Upvotes: 0

Adverse_Event
Adverse_Event

Reputation: 113

I think this does the trick. Note, it only sorts by the actual number and ignores the letters. It's not sensitive to letters attached at the end of the last number, since that's how the data looks, but the regular expression can be modified to fit whatever needs.

library(data.table)
datalist <- c("20191107_1545_28.xlsx","20191108_1520_95.xlsx","20191108_1104_99.xlsx","20200127_1505_28.xlsx", "20200124_1505_41B.xlsx", "20200122_1505_1.xlsx", "20191102_1520_102.xlsx")


dt <- data.table('datalist' = datalist)
dt[, 'num1' := as.numeric(gsub(pattern = '(\\d{1,10})(_)(\\d{1,10})(_)(\\d{1,10})(.*)', x = datalist, replacement = '\\1'))]
dt[, 'num2' := as.numeric(gsub(pattern = '(\\d{1,10})(_)(\\d{1,10})(_)(\\d{1,10})(.*)', x = datalist, replacement = '\\3'))]
dt[, 'num3' := as.numeric(gsub(pattern = '(\\d{1,10})(_)(\\d{1,10})(_)(\\d{1,10})(.*)', x = datalist, replacement = '\\5'))]

setkey(dt, num3, num1)
print(dt$datalist)

Edit: forgot to make coerce to numeric. Fixed.

Upvotes: 1

Related Questions