Reputation: 215
I have an Excel file with a column of alphanumeric codes of variable length - see sample at end of question. I have been trying to read this into R using the readxl
library:
read_xlsx("File.xlsx")
This works for over 90% of the data, but for a significant minority of data points, I'm getting a floating point error, e.g. 2.2000000000000002
or 8.1999999999999993
.
This only occurs for codes in the format "number point number", i.e. not for stand alone numbers (2
or 8
) or for codes including letters (2.4.a
or 5.2.b.i
).
I have selected the entire column in Excel and formatted as Text
, but without improving the situation. I have also tried using a different R library to read in the data: openxlsx
's read.xlsx
function, but the same error occurs. This suggests the error is in how Excel is saving the file (treating decimal codes as numeric, even though I have set the whole column to text), rather than an error with R. R is successfully reading the column in as character (<chr>
).
In original .xlsx file | Read in by R as |
---|---|
1 | 1 |
1.1 | 1.1000000000000001 |
1.2 | 1.2 |
1.3 | 1.3 |
2 | 2 |
2.1 | 2.1 |
2.2 | 2.2000000000000002 |
2.3 | 2.2999999999999998 |
2.4 | 2.4 |
2.5 | 2.5 |
3 | 3 |
3.1 | 3.1 |
3.1.a | 3.1.a |
3.1.b | 3.1.b |
3.1.c | 3.1.c |
3.1.d | 3.1.d |
3.2 | 3.2 |
3.3 | 3.3 |
3.3.a | 3.3.a |
3.3.b | 3.3.b |
3.4 | 3.4 |
4 | 4 |
4.1 | 4.0999999999999996 |
4.1.a | 4.1.a |
4.1.b | 4.1.b |
Upvotes: 0
Views: 119
Reputation: 76495
Here is a solution.
Write a function that
NA
;library(readxl)
round_char <- function(x) {
y <- as.numeric(x)
i <- is.na(y)
y[i] <- x[i]
y
}
df1 <- read_xlsx("~/Temp/so_79134563.xlsx", col_types = c("text", "text"))
round_char(df1$read_xlsx)
#> Warning in round_char(df1$read_xlsx): NAs introduced by coercion
#> [1] "1" "1.1" "1.2" "1.3" "2" "2.1" "2.2" "2.3" "2.4"
#> [10] "2.5" "3" "3.1" "3.1.a" "3.1.b" "3.1.c" "3.1.d" "3.2" "3.3"
#> [19] "3.3.a" "3.3.b" "3.4" "4" "4.1" "4.1.a" "4.1.b"
df1$rounded <- round_char(df1$read_xlsx)
#> Warning in round_char(df1$read_xlsx): NAs introduced by coercion
df1
#> # A tibble: 25 × 3
#> Original read_xlsx rounded
#> <chr> <chr> <chr>
#> 1 1 1 1
#> 2 1.1 1.1000000000000001 1.1
#> 3 1.2 1.2 1.2
#> 4 1.3 1.3 1.3
#> 5 2 2 2
#> 6 2.1 2.1 2.1
#> 7 2.2 2.2000000000000002 2.2
#> 8 2.3 2.2999999999999998 2.3
#> 9 2.4 2.4 2.4
#> 10 2.5 2.5 2.5
#> # ℹ 15 more rows
Created on 2024-10-28 with reprex v2.1.1
Upvotes: 0
Reputation: 4603
Such style "1.1" is problematic since it can be accepted by a number or a date. It can be resolved by putting the apostrophe sign at the start of the string: "'1.1". This enforces Excel to treat the entered text as a text. Potentially this can solve the problem. Otherwise put any letter at the start of the string: "R1.1". Considering its meaning it should not affect anything, but codes will not be parsed as numbers.
UPD
The better way I guess is to replace the comma by, for e. g., the Middle Dot (0x00B7): "1·1"
Upvotes: 0