Sort data frame based on the prefix of the values within the columns

Question

I'm a student in engineering in France and i'm doing a projet in R for university. I'm actually looking for doing something specific with my database.

My database looks like that :

      id         grades     std_id               UID
     1004          1         1004        cm1_AZZ_005_LKJ_xxx
     1004          1         1004        cm1_AZZ_002_LKJ_xxx
     1004          0         1004        cm1_AZZ_005_LKJ_xxx
     1004          1         1004        cm1_AZZ_002_LKJ_xxx
     1004          0         1004        cm1_AZZ_002_LKJ_xxx
     1004          1         1004        cm1_AZZ_009_LKJ_xxx
     1004          1         1004        cm1_AZZ_002_LKJ_xxx
     7687          1         0897        cm1_XYZ_457_HGF_xxx
     7687          1         0897        cm1_XYZ_970_HGF_xxx
     7687          1         0897        cm1_XBZ_674_KGH_xxx
     7687          0         0897        cm1_XBZ_987_KGH_xxx
     7687          1         0897        cm1_XBZ_780_KGH_xxx
     ....        .....       ....               .....

I would like to sort my database with the values within the column UID.

My data base is larger and the values within the UID can be different below the data base.

Presently, I'm taking the interval of each different UID manually but it's clearly ineffective :

list_002 <- new_items[1:7] 
list_003 <- new_items[8:9]
list_005 <- new_items[10:12]

As you can see I would like to sort my database just with the prefix of the UID and not with all characters.

Prefix : cm1_AZZ, cm1_XYZ, cm1_XBZ

In my database the UID prefix are always (cm1_AZZ, cm1_XYZ, cm1_XBZ) but the suffix can change.

I would like to sort the database in 3 different lists based on UID suffix "(cm1_AZZ, cm1_XYZ, cm1_XBZ)" to have 3 different lists and not a different list by each UID.

Like that :

list_AAZ <- list()
list_XYZ <- list()
list_XBZ <- list()

list_AZZ <- cm1_AZZ_005      list_XYZ <- cm1_XYZ_457 
            cm1_AZZ_002                  cm1_XYZ_970
            cm1_AZZ_005
            cm1_AZZ_002
            cm1_AZZ_002
            cm1_AZZ_009
            cm1_AZZ_002

list_X4Z <- cm1_XBZ_674
            cm1_XBZ_987
            cm1_XBZ_780

Thank for helping me. Sorry for my poor english.

talat · Accepted Answer

Using split and sub you could do:

# original answer (before question update):
# new_list <- split(df, sub("(cm1_\d{3}).*", "\1", df$UID))
# updated answer:
new_list <- split(df, sub("(cm1_[^_]+).*", "\1", df$UID))

This will return a list where each set of UID-group (excluding suffix) is a data.frame.

You can then access the elements for example using

new_list$cm1_AZZ

or

new_list[[2]]

Sort data frame based on the prefix of the values within the columns

Answers (1)

Related Questions