Chris Wiese
Chris Wiese

Reputation: 87

Keeping column labels when using the select function with dplyr

I'm combining several datasets and keeping only specific columns, however, when I use the select function from dplyr, it does not retain the original column labels (in my case, the item text).

After I import the data, I use the following code to select the variables I'd like to keep:

Wave1Data_Clean <- select(Wave1Data, ID, x1, x2, x3, x4, x5)

When I do this, the column names are correct as are the values, however, the column labels are replaced with the labels of the original dataframe. For example, if column #1's label was 'location' in the original data, the column name for the ID in the df would be 'location' instead of the label associated with ID.

Is there a way to retain the appropriate column labels?

Upvotes: 3

Views: 1643

Answers (2)

stragu
stragu

Reputation: 1291

In the current version of dplyr (1.0.5), the attribute "variable.labels" is lost entirely when using dplyr::select(). It seems that a better way to store column labels is to have attributes associated to the vectors themselves rather than the whole dataframe, which is what functions from the haven package do. Using haven::read_sav() as an alternative will keep the labels when transforming the data.

Here is a reproducible comparison:

library(foreign)
# import example SAV from foreign
sav <- system.file("files", "electric.sav", package = "foreign")
dat_for <- read.spss(file = sav, to.data.frame = TRUE)

# labels are saved as a variable.labels attribute for the whole dataframe
str(dat_for)
#> 'data.frame':    240 obs. of  13 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...

# dplyr::select() loses the variable.labels attribute:
library(dplyr, warn.conflicts = FALSE)
dat_for %>% select(1:2) %>% str()
#> 'data.frame':    240 obs. of  2 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...

# but to be fair, base slicing also loses the attribute:
dat_for[,1:2] %>% str()
#> 'data.frame':    240 obs. of  2 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...

# import same file with haven
library(haven)
dat_hav <- read_sav(file = sav)

# labels are attributes of columns
str(dat_hav$CASEID)
#>  num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#>  - attr(*, "label")= chr "CASE IDENTIFICATION NUMBER"
#>  - attr(*, "format.spss")= chr "F4.0"
#>  - attr(*, "display_width")= int 0

# dplyr::select() keeps them
dat_hav %>% select(1:2) %>% str()
#> tibble[,2] [240 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ CASEID  : num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#>   ..- attr(*, "label")= chr "CASE IDENTIFICATION NUMBER"
#>   ..- attr(*, "format.spss")= chr "F4.0"
#>   ..- attr(*, "display_width")= int 0
#>  $ FIRSTCHD: dbl+lbl [1:240] 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 6, 2, 3, 5, 3, 3, 3, 3, ...
#>    ..@ label        : chr "FIRST CHD EVENT"
#>    ..@ format.spss  : chr "F1.0"
#>    ..@ display_width: int 0
#>    ..@ labels       : Named num [1:5] 1 2 3 5 6
#>    .. ..- attr(*, "names")= chr [1:5] "NO CHD" "SUDDEN  DEATH" "NONFATALMI" "FATAL   MI" ...
#>  - attr(*, "label")= chr "                       SPSS/PC+"

Created on 2021-03-30 by the reprex package (v1.0.0)

Upvotes: 0

Kelly
Kelly

Reputation: 43

I can see the problem you describe when I view the data frame in RStudio.

Example_Clean in RStudio

I'm not sure why the labels get mixed up, but you can overwrite them with the correct labels.

Assuming that you're using read.spss from the foreign package, the labels are stored as an attribute called "variable.labels". You can access the labels using:

attr(Example_Clean, "variable.labels")

The following code overwrites the labels with the correct ones.

var_labs <- attr(Example_Clean, "variable.labels")
var_labs <- var_labs[names(Example_Clean)]
attr(Example_Clean, "variable.labels") <- var_labs

Upvotes: 2

Related Questions