bixiou
bixiou

Reputation: 149

R package/functions allowing two kinds of missing values (kept and not in regressions) and handling labelled variables both as numeric and character

I want to handle survey data and I would like functions ideal_labelled and is.missing with the following behavior:

test <- ideal_label(c(1, NA, -1), 
                    labels = structure(c(0, 1, -1), names = c("No", "Yes", "PNR")), 
                    missing.values = c(NA, -1))
  
as.character(test[1])   # "Yes"
as.numeric(test[1])   # 1
test %in% 1   # TRUE FALSE FALSE
test == 1   # TRUE NA FALSE
test %in% "Yes"   # TRUE FALSE FALSE
test == "Yes"   # TRUE NA FALSE
is.na(test)   # FALSE TRUE FALSE
is.missing(test)   # FALSE TRUE TRUE 
lm(c(T, T, T) ~ test)$rank   # 2 (i.e., keeps missing values that are not NA)
df <- data.frame(test = test, true = c(T, T, T))
lm(true ~ test, data = df)$rank   # 2 

This used to be possible with function as.item (and is.missing) of package memisc, with memisc version 0.99.22 and R Version 4.2.1.

However, more recent versions of memisc treat missing values the same as NA (i.e., is.na(test[3]) returns TRUE). And using memisc version 0.99.22 with more recent versions of R tend to treat labelled variables as numerical rather than characters (namely, test[1] == "Yes" returns NA and test[1] %in% "Yes" returns FALSE).

I have tested other packages (haven, labelled, forcats) but none of them seem to allow the behavior I need.

How do I achieve this with the latest versions of these libraries?

Upvotes: 0

Views: 55

Answers (1)

Joseph Larmarange
Joseph Larmarange

Reputation: 114

First of all, it should be noted that the labelled package follows the haven classes regarded labelled data. haven implements two types of additional missing values: SAS/Stata-like tagged NAs and SPSS-like user NAs. They are presented in detail in https://larmarange.github.io/labelled/articles/missing_values.html

What you are referring to is more similar to user NAs available through the haven_labelled_spss class. labelled already provides functions to distinct regular NAs from user NAs.

library(labelled)

v <- labelled_spss(
  c(1, NA, -1),
  labels = c(No = 0, Yes = 1, PNR = -1),
  na_values = -1
)
  
v
#> <labelled_spss<double>[3]>
#> [1]  1 NA -1
#> Missing values: -1
#> 
#> Labels:
#>  value label
#>      0    No
#>      1   Yes
#>     -1   PNR
is.na(v)
#> [1] FALSE  TRUE  TRUE
is_regular_na(v)
#> [1] FALSE  TRUE FALSE
is_user_na(v)
#> [1] FALSE FALSE  TRUE

Created on 2025-02-28 with reprex v2.1.1

labelled also provides many function to convert to other formats.

library(labelled)

v <- labelled_spss(
  c(1, NA, -1),
  labels = c(No = 0, Yes = 1, PNR = -1),
  na_values = -1
)
  
to_factor(v)
#> [1] Yes  <NA> PNR 
#> Levels: No Yes PNR
to_factor(v, user_na_to_na = TRUE)
#> [1] Yes  <NA> <NA>
#> Levels: No Yes
to_character(v)
#> [1] "Yes" NA    "PNR"
to_character(v, user_na_to_na = TRUE)
#> [1] "Yes" NA    NA
unclass(v)
#> [1]  1 NA -1
#> attr(,"labels")
#>  No Yes PNR 
#>   0   1  -1 
#> attr(,"na_values")
#> [1] -1
user_na_to_na(v)
#> <labelled<double>[3]>
#> [1]  1 NA NA
#> 
#> Labels:
#>  value label
#>      0    No
#>      1   Yes
user_na_to_na(v) |> unclass()
#> [1]  1 NA NA
#> attr(,"labels")
#>  No Yes 
#>   0   1

Created on 2025-02-28 with reprex v2.1.1

Regarding the comparison operators you mentioned in your stackoverflow question, you have to keep in mind that haven allows both numeric labelled vector (i.e. 1 coded as 'Yes') and character labelled vector (i.e. "y" coded as "yes"). So the automatic distinction between numeric/character doesn't always work to distinct the proper value from the label. However, you can esaily adapt your comparison tests with conversion and/or create your custom operators.

library(labelled)

v <- labelled_spss(
  c(1, NA, -1),
  labels = c(No = 0, Yes = 1, PNR = -1),
  na_values = -1
)
  
v == 1
#> [1]  TRUE    NA FALSE
v %in% 1
#> [1]  TRUE FALSE FALSE

to_character(v) == "Yes"
#> [1]  TRUE    NA FALSE
to_character(v) %in% "Yes"
#> [1]  TRUE FALSE FALSE

`%l=%` <- function(x, y) {
  to_character(x) == y
}

`%lin%` <- function(x, y) {
  to_character(x) %in% y
}

v %l=% "Yes"
#> [1]  TRUE    NA FALSE
v %lin% "Yes"
#> [1]  TRUE FALSE FALSE

Created on 2025-02-28 with reprex v2.1.1

Upvotes: -1

Related Questions