Reputation: 23
I'm unable to correctly parse dates in a tab separated file (columns BEGINNPROBENAHME and ENDEPROBENAHME) that was saved using a ISO-8859-1 or Windows-1252 encoding. The dates unfortunately are written in the format %d-%b-%y with a non-standard (German) month abbreviation, e.g. 18-Mär-22 (18 March 2022). To this end I'm using a custom locale with custom date_names
.
The following code works in some cases, but some fields are not parsed. Playing around with the larger original file showed that somehow some guessing (by lubridate?) must be involved in the problem, since some fields occasionally would fail to be parse and in others parsing would work (e.g. with some previous rows removed). Interestingly enough, parsing works without issues if the file is re-saved using an UTF-8 encoding (with encoding = "UTF-8"
in locale()
)!
What makes no sense to me is that the parsing works correctly with a Unicode file but not with the ISO-8859-1 file. Shouldn't readr first convert all strings to UTF-8 (for internal representation) and only then attempt parsing of dates?
Any pointers/suggestions welcome. I'm aware that I could just convert the inpute file(s) to Unicode or replace the abbreviations with full names or month numbers but would like to make this work with the given file and readr/lubridate - if possible.
Code:
library(tidyverse)
test_daten_pfad <- "daten/example_iso88591_dby.txt"
spezifikation_mvdaten <- cols(
BEGINNPROBENAHME = col_date("%d-%b-%y"),
ENDEPROBENAHME = col_date("%d-%b-%y"),
PARAMETER = col_character(),
WERT_NUM = col_double()
)
mon <- c("Januar", "Februar", "März", "April", "Mai", "Juni", "Juli", "August", "September", "Oktober", "November", "Dezember")
monab <- c("Jan", "Feb", "Mär", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez")
day <- c("Sonntag", "Montag", "Dienstag", "Mittwoch", "Donnerstag", "Freitag", "Samstag")
dayab <- c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa")
deCH_locale <- readr::locale(date_names = date_names(mon, monab, day, dayab), date_format = "%d-%b-%y", encoding = "ISO-8859-1", tz = "Europe/Zurich")
test_daten <- readr::read_delim(
test_daten_pfad,
col_types = spezifikation_mvdaten,
locale = deCH_locale,
delim = "\t"
)
probleme <- problems(test_daten)
Content of example file (example_iso88591_dby.txt) - encoding must be set accordingly:
BEGINNPROBENAHME ENDEPROBENAHME PARAMETER WERT_NUM
21-Mai-22 25-Mai-22 Bifenthrin 50
05-Jan-22 19-Jan-22 Fludioxonil 10
19-Jan-22 02-Feb-22 Diclofenac 336.3
02-Feb-22 16-Feb-22 Gabapentin 331.61
16-Feb-22 02-Mär-22 Triclosan 10
02-Mär-22 16-Mär-22 Aclonifen 10
02-Mär-22 16-Mär-22 Amidotrizoesäure 1143.05
16-Mär-22 30-Mär-22 Metsulfuron-methyl 5
16-Mär-22 30-Mär-22 Napropamid 5
30-Mär-22 13-Apr-22 Diclofenac 310.16
13-Apr-22 27-Apr-22 Chlorpyrifos-methyl 50
27-Apr-22 11-Mai-22 2,4-D 20
11-Mai-22 25-Mai-22 Cyproconazol 5
25-Mai-22 08-Jun-22 Venlafaxin 85.23
08-Jun-22 22-Jun-22 2,4-D 20
Content of test_daten
:
structure(list(BEGINNPROBENAHME = structure(c(19133, 18997, 19011,
19025, 19039, NA, NA, NA, NA, NA, 19095, 19109, 19123, 19137,
19151), class = "Date"), ENDEPROBENAHME = structure(c(19137,
19011, 19025, 19039, NA, NA, NA, NA, NA, 19095, 19109, 19123,
19137, 19151, 19165), class = "Date"), PARAMETER = c("Bifenthrin",
"Fludioxonil", "Diclofenac", "Gabapentin", "Triclosan", "Aclonifen",
"Amidotrizoesäure", "Metsulfuron-methyl", "Napropamid", "Diclofenac",
"Chlorpyrifos-methyl", "2,4-D", "Cyproconazol", "Venlafaxin",
"2,4-D"), WERT_NUM = c(50, 10, 336.3, 331.61, 10, 10, 1143.05,
5, 5, 310.16, 50, 20, 5, 85.23, 20)), row.names = c(NA, -15L), spec = structure(list(
cols = list(BEGINNPROBENAHME = structure(list(format = "%d-%b-%y"), class = c("collector_date",
"collector")), ENDEPROBENAHME = structure(list(format = "%d-%b-%y"), class = c("collector_date",
"collector")), PARAMETER = structure(list(), class = c("collector_character",
"collector")), WERT_NUM = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = "\t"), class = "col_spec"), problems = <pointer: 0x1402099f0>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
Content of probleme
:
# A tibble: 10 × 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 6 2 date like %d-%b-%y "02-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
2 7 1 date like %d-%b-%y "02-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
3 7 2 date like %d-%b-%y "16-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
4 8 1 date like %d-%b-%y "02-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
5 8 2 date like %d-%b-%y "16-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
6 9 1 date like %d-%b-%y "16-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
7 9 2 date like %d-%b-%y "30-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
8 10 1 date like %d-%b-%y "16-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
9 10 2 date like %d-%b-%y "30-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
10 11 1 date like %d-%b-%y "30-M\xe4r-22" /Users/redacted/daten/example_iso88591_dby.txt
Sessioninfo():
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.0.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[7] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.3 rlang_1.1.4 stringi_1.8.4 generics_0.1.3 glue_1.8.0
[7] colorspace_2.1-1 hms_1.1.3 scales_1.3.0 fansi_1.0.6 grid_4.4.1 munsell_0.5.1
[13] tzdb_0.4.0 lifecycle_1.0.4 compiler_4.4.1 timechange_0.3.0 pkgconfig_2.0.3 rstudioapi_0.16.0
[19] R6_2.5.1 tidyselect_1.2.1 utf8_1.2.4 pillar_1.9.0 magrittr_2.0.3 tools_4.4.1
[25] withr_3.0.1 gtable_0.3.5
Upvotes: 0
Views: 50