readr::col_date() inconsistent for abbreviated month names with non-UTF8 files?

Question

Problem description

I'm unable to correctly parse dates in a tab separated file (columns BEGINNPROBENAHME and ENDEPROBENAHME) that was saved using a ISO-8859-1 or Windows-1252 encoding. The dates unfortunately are written in the format %d-%b-%y with a non-standard (German) month abbreviation, e.g. 18-Mär-22 (18 March 2022). To this end I'm using a custom locale with custom date_names.

The following code works in some cases, but some fields are not parsed. Playing around with the larger original file showed that somehow some guessing (by lubridate?) must be involved in the problem, since some fields occasionally would fail to be parse and in others parsing would work (e.g. with some previous rows removed). Interestingly enough, parsing works without issues if the file is re-saved using an UTF-8 encoding (with encoding = "UTF-8" in locale())! What makes no sense to me is that the parsing works correctly with a Unicode file but not with the ISO-8859-1 file. Shouldn't readr first convert all strings to UTF-8 (for internal representation) and only then attempt parsing of dates?

Any pointers/suggestions welcome. I'm aware that I could just convert the inpute file(s) to Unicode or replace the abbreviations with full names or month numbers but would like to make this work with the given file and readr/lubridate - if possible.

Code:

library(tidyverse)

test_daten_pfad <- "daten/example_iso88591_dby.txt"

spezifikation_mvdaten <- cols(
  BEGINNPROBENAHME = col_date("%d-%b-%y"),
  ENDEPROBENAHME = col_date("%d-%b-%y"),
  PARAMETER = col_character(),
  WERT_NUM = col_double()
)

mon <- c("Januar", "Februar", "März", "April", "Mai", "Juni", "Juli", "August", "September", "Oktober", "November", "Dezember")
monab <- c("Jan", "Feb", "Mär", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez")
day <- c("Sonntag", "Montag", "Dienstag", "Mittwoch", "Donnerstag", "Freitag", "Samstag")
dayab <- c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa")
deCH_locale <- readr::locale(date_names = date_names(mon, monab, day, dayab), date_format = "%d-%b-%y", encoding = "ISO-8859-1", tz = "Europe/Zurich")

test_daten <- readr::read_delim(
  test_daten_pfad,
  col_types = spezifikation_mvdaten,
  locale = deCH_locale,
  delim = "	"
)

probleme <- problems(test_daten)

Content of example file (example_iso88591_dby.txt) - encoding must be set accordingly:

BEGINNPROBENAHME    ENDEPROBENAHME  PARAMETER   WERT_NUM
21-Mai-22   25-Mai-22   Bifenthrin  50
05-Jan-22   19-Jan-22   Fludioxonil 10
19-Jan-22   02-Feb-22   Diclofenac  336.3
02-Feb-22   16-Feb-22   Gabapentin  331.61
16-Feb-22   02-Mär-22   Triclosan   10
02-Mär-22   16-Mär-22   Aclonifen   10
02-Mär-22   16-Mär-22   Amidotrizoesäure    1143.05
16-Mär-22   30-Mär-22   Metsulfuron-methyl  5
16-Mär-22   30-Mär-22   Napropamid  5
30-Mär-22   13-Apr-22   Diclofenac  310.16
13-Apr-22   27-Apr-22   Chlorpyrifos-methyl 50
27-Apr-22   11-Mai-22   2,4-D   20
11-Mai-22   25-Mai-22   Cyproconazol    5
25-Mai-22   08-Jun-22   Venlafaxin  85.23
08-Jun-22   22-Jun-22   2,4-D   20

Content of test_daten:

structure(list(BEGINNPROBENAHME = structure(c(19133, 18997, 19011, 
19025, 19039, NA, NA, NA, NA, NA, 19095, 19109, 19123, 19137, 
19151), class = "Date"), ENDEPROBENAHME = structure(c(19137, 
19011, 19025, 19039, NA, NA, NA, NA, NA, 19095, 19109, 19123, 
19137, 19151, 19165), class = "Date"), PARAMETER = c("Bifenthrin", 
"Fludioxonil", "Diclofenac", "Gabapentin", "Triclosan", "Aclonifen", 
"Amidotrizoesäure", "Metsulfuron-methyl", "Napropamid", "Diclofenac", 
"Chlorpyrifos-methyl", "2,4-D", "Cyproconazol", "Venlafaxin", 
"2,4-D"), WERT_NUM = c(50, 10, 336.3, 331.61, 10, 10, 1143.05, 
5, 5, 310.16, 50, 20, 5, 85.23, 20)), row.names = c(NA, -15L), spec = structure(list(
    cols = list(BEGINNPROBENAHME = structure(list(format = "%d-%b-%y"), class = c("collector_date", 
    "collector")), ENDEPROBENAHME = structure(list(format = "%d-%b-%y"), class = c("collector_date", 
    "collector")), PARAMETER = structure(list(), class = c("collector_character", 
    "collector")), WERT_NUM = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), delim = "	"), class = "col_spec"), problems = , class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

Content of probleme:

# A tibble: 10 × 5
     row   col expected           actual         file                                                        
                                                                                    
 1     6     2 date like %d-%b-%y "02-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
 2     7     1 date like %d-%b-%y "02-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
 3     7     2 date like %d-%b-%y "16-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
 4     8     1 date like %d-%b-%y "02-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
 5     8     2 date like %d-%b-%y "16-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
 6     9     1 date like %d-%b-%y "16-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
 7     9     2 date like %d-%b-%y "30-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
 8    10     1 date like %d-%b-%y "16-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
 9    10     2 date like %d-%b-%y "30-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
10    11     1 date like %d-%b-%y "30-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt

Steps taken to solve problem

Tried parsing only a subset of the original file (problem persists)
Tried parsing a UTF-8 version of the file (worked)
Set column types to col_character() to see if Umlaut (ä) is correctly parsed - works

Environment

Sessioninfo():

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.0.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2     readr_2.1.5    
 [7] tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.3         rlang_1.1.4       stringi_1.8.4     generics_0.1.3    glue_1.8.0       
 [7] colorspace_2.1-1  hms_1.1.3         scales_1.3.0      fansi_1.0.6       grid_4.4.1        munsell_0.5.1    
[13] tzdb_0.4.0        lifecycle_1.0.4   compiler_4.4.1    timechange_0.3.0  pkgconfig_2.0.3   rstudioapi_0.16.0
[19] R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3    tools_4.4.1      
[25] withr_3.0.1       gtable_0.3.5

readr::col_date() inconsistent for abbreviated month names with non-UTF8 files?

Problem description

Steps taken to solve problem

Environment

Answers (0)

Related Questions