whv20
whv20

Reputation: 215

Avoid floating point error when importing mixed alphanumeric / numeric data from Excel into R

I have an Excel file with a column of alphanumeric codes of variable length - see sample at end of question. I have been trying to read this into R using the readxl library:

read_xlsx("File.xlsx")

This works for over 90% of the data, but for a significant minority of data points, I'm getting a floating point error, e.g. 2.2000000000000002 or 8.1999999999999993.

This only occurs for codes in the format "number point number", i.e. not for stand alone numbers (2 or 8) or for codes including letters (2.4.a or 5.2.b.i).

I have selected the entire column in Excel and formatted as Text, but without improving the situation. I have also tried using a different R library to read in the data: openxlsx's read.xlsx function, but the same error occurs. This suggests the error is in how Excel is saving the file (treating decimal codes as numeric, even though I have set the whole column to text), rather than an error with R. R is successfully reading the column in as character (<chr>).

Sample data

In original .xlsx file Read in by R as
1 1
1.1 1.1000000000000001
1.2 1.2
1.3 1.3
2 2
2.1 2.1
2.2 2.2000000000000002
2.3 2.2999999999999998
2.4 2.4
2.5 2.5
3 3
3.1 3.1
3.1.a 3.1.a
3.1.b 3.1.b
3.1.c 3.1.c
3.1.d 3.1.d
3.2 3.2
3.3 3.3
3.3.a 3.3.a
3.3.b 3.3.b
3.4 3.4
4 4
4.1 4.0999999999999996
4.1.a 4.1.a
4.1.b 4.1.b

Upvotes: 0

Views: 119

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76495

Here is a solution.
Write a function that

  • tries to coerce the values to numeric;
  • If some cannot be coerced, they become NA;
  • assign the new values based on the corresponding elements of the input vector.
library(readxl)

round_char <- function(x) {
  y <- as.numeric(x)
  i <- is.na(y)
  y[i] <- x[i]
  y
}

df1 <- read_xlsx("~/Temp/so_79134563.xlsx", col_types = c("text", "text"))

round_char(df1$read_xlsx)
#> Warning in round_char(df1$read_xlsx): NAs introduced by coercion
#>  [1] "1"     "1.1"   "1.2"   "1.3"   "2"     "2.1"   "2.2"   "2.3"   "2.4"  
#> [10] "2.5"   "3"     "3.1"   "3.1.a" "3.1.b" "3.1.c" "3.1.d" "3.2"   "3.3"  
#> [19] "3.3.a" "3.3.b" "3.4"   "4"     "4.1"   "4.1.a" "4.1.b"

df1$rounded <- round_char(df1$read_xlsx)
#> Warning in round_char(df1$read_xlsx): NAs introduced by coercion
df1
#> # A tibble: 25 × 3
#>    Original read_xlsx          rounded
#>    <chr>    <chr>              <chr>  
#>  1 1        1                  1      
#>  2 1.1      1.1000000000000001 1.1    
#>  3 1.2      1.2                1.2    
#>  4 1.3      1.3                1.3    
#>  5 2        2                  2      
#>  6 2.1      2.1                2.1    
#>  7 2.2      2.2000000000000002 2.2    
#>  8 2.3      2.2999999999999998 2.3    
#>  9 2.4      2.4                2.4    
#> 10 2.5      2.5                2.5    
#> # ℹ 15 more rows

Created on 2024-10-28 with reprex v2.1.1

Upvotes: 0

rotabor
rotabor

Reputation: 4603

Such style "1.1" is problematic since it can be accepted by a number or a date. It can be resolved by putting the apostrophe sign at the start of the string: "'1.1". This enforces Excel to treat the entered text as a text. Potentially this can solve the problem. Otherwise put any letter at the start of the string: "R1.1". Considering its meaning it should not affect anything, but codes will not be parsed as numbers.

UPD

The better way I guess is to replace the comma by, for e. g., the Middle Dot (0x00B7): "1·1"

Upvotes: 0

Related Questions