GaB
GaB

Reputation: 1134

why do I not get the ordering in alphabetical order in my table, in R? Only with tidyverse

I am trying to arrange 'Smoking status' categories in alphabetical order.This shoudl be only with tidyverse.

This is what I have tried

smoking_gender_disch_piv_count_ren <- smoking_gender_disch_piv_count %>%
       dplyr::rename('Smoking Status' = smoking_status) %>%
       dplyr::arrange('Smoking status')
     smoking_gender_disch_piv_count_ren

As one can see, I do not get Current smoker first, and then ex smoker, etc. I thought arrange function in dplyr will do the trick. But it does not.

This is the data I have:

structure(list(smoking_status = structure(1:5, .Label = c("Ex smoker", 
"Current smoker", "Never smoked", "Unknown", "Non smoker - smoking history unknown"
), class = "factor"), Female = c(24.0601503759398, 9.02255639097744, 
35.3383458646617, 6.01503759398496, 25.5639097744361), Male = c(34.9753694581281, 
13.7931034482759, 23.6453201970443, 1.97044334975369, 25.615763546798
), NSTEMI = c(31.9078947368421, 12.5, 28.2894736842105, 3.28947368421053, 
24.0131578947368), STEMI = c(18.75, 6.25, 28.125, 6.25, 40.625
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

Upvotes: 0

Views: 711

Answers (1)

Greg
Greg

Reputation: 3326

Aside from misspelling 'Smoking Status' as 'Smoking status', you ran into two other problems.

Variable Names vs. Strings

We use single (') or double quotes (") to designate strings: 'my string' or "my string". However, to designate (unusual) variable names (symbols) with spaces in them, we use backticks (`): `my variable`. Since it's a pain to type those backticks, we typically use underscores (_) rather than spaces in variable names.

When (re)naming columns, character strings are as good as symbols. That is

  # ... %>%
  dplyr::rename('Smoking Status' = smoking_status) # %>% ...
  #             |--------------|
  #             character string

is equivalent to

  # ... %>%
  dplyr::rename(`Smoking Status` = smoking_status) # %>% ...
  #             |--------------|
  #                  symbol

However, when performing vectorized operations with mutate() or filter() or arrange(), any string will be treated as simply a scalar character value. That is

  # ... %>%
  mutate(test = 'Smoking Status') # %>% ...
  #             |--------------|
  #             character string

will not copy the `Smoking Status` column (a factor)

# A tibble: 5 x 6
  ... test                                
  ... <fct>                               
1 ... Ex smoker                           
2 ... Current smoker                      
3 ... Never smoked                        
4 ... Unknown                             
5 ... Non smoker - smoking history unknown

but rather give you a (character) column filled with the literal string 'Smoking Status':

# A tibble: 5 x 6
  ... test          
  ... <chr>         
1 ... Smoking Status
2 ... Smoking Status
3 ... Smoking Status
4 ... Smoking Status
5 ... Smoking Status

Similarly, your

  # ... %>%
  dplyr::arrange('Smoking Status')
  #                       |----|
  #      Corrected typo: 'status'.

does not sort on the `Smoking Status` column, but rather on a (temporary) column filled with the string 'Smoking Status'. Since everything in that column is the same, no rearranging occurs at all, and the smoking_gender_disch_piv_count dataset remains unchanged.

Fix

To fix this particular issue, use:

  # ... %>%
  dplyr::arrange(`Smoking Status`)

Strings vs. Factors

Even after fixing the issue above, you'll still have a problem. Your Smoking Status column is a factor

[1] Ex smoker                            Current smoker                       Never smoked                         Unknown                              Non smoker - smoking history unknown
Levels: Ex smoker Current smoker Never smoked Unknown Non smoker - smoking history unknown

so when you sort on this column, it follow the ordering of the factor levels, which are visibly not in alphabetical order.

Fix

To sort by alphabetical order, use the character form of the `Smoking Status` column:

  # ... %>%
  dplyr::arrange(as.character(`Smoking Status`))

Solution

Given the smoking_gender_disch_piv_count dataset you reproduced

smoking_gender_disch_piv_count <-
  structure(list(smoking_status = structure(1:5, .Label = c("Ex smoker", "Current smoker", "Never smoked", "Unknown", "Non smoker - smoking history unknown"), class = "factor"),
                 Female = c(24.0601503759398, 9.02255639097744, 35.3383458646617, 6.01503759398496, 25.5639097744361),
                 Male = c(34.9753694581281, 13.7931034482759, 23.6453201970443, 1.97044334975369, 25.615763546798),
                 NSTEMI = c(31.9078947368421, 12.5, 28.2894736842105, 3.28947368421053, 24.0131578947368),
                 STEMI = c(18.75, 6.25, 28.125, 6.25, 40.625)),
            row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))

the following dplyr workflow

smoking_gender_disch_piv_count_ren <- smoking_gender_disch_piv_count %>%
  dplyr::rename(`Smoking Status` = smoking_status) %>%
  dplyr::arrange(as.character(`Smoking Status`))

will give you your desired results for smoking_gender_disch_piv_count_ren

# A tibble: 5 x 5
  `Smoking Status`                     Female  Male NSTEMI STEMI
  <fct>                                 <dbl> <dbl>  <dbl> <dbl>
1 Current smoker                         9.02 13.8   12.5   6.25
2 Ex smoker                             24.1  35.0   31.9  18.8 
3 Never smoked                          35.3  23.6   28.3  28.1 
4 Non smoker - smoking history unknown  25.6  25.6   24.0  40.6 
5 Unknown                                6.02  1.97   3.29  6.25

while still preserving the factor information in `Smoking Status`.

Upvotes: 2

Related Questions