Reputation: 99
I have an excel file (.xlsx) with 15000 records which I loaded to R, and there is a column 'X' which has data after 10000 rows.
Data <- read_excel("Business_Data.xlsx", sheet = 3, skip = 2)
When I checked the dataframe after importing file, I could see only NA in that 'X' column. Rather, column X has factors like "Cost +, Resale-, Purchase" which are not getting captured. Is it because the data for this column contains after 10000 records? Or am I missing something?
Upvotes: 1
Views: 579
Reputation: 1165
read_excel
tries to infer the type of the data using the first 1000 rows by default.
If it can't get the right type and can't coerce the data to this type, you'll get NA.
You probably had a warning : "There were 50 or more warnings (use warnings() to see the first 50)"
And checking the warnings tells you something like :
> warnings()
Messages d'avis :
1: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, ... :
Expecting logical in B15002 / R15002C2: got 'A'
...
Solution : add the argument guess_max = 20000
library(tidyverse)
library(writexl)
library(readxl)
# create a dataframe with a character column "empty" at the beginning
df1 <- tibble(x = 1:20000,
y = c(rep(NA_character_, 15000), rep("A", 5000)))
# bottom rows are OK
tail(df1)
#> # A tibble: 6 x 2
#> x y
#> <int> <chr>
#> 1 19995 A
#> 2 19996 A
#> 3 19997 A
#> 4 19998 A
#> 5 19999 A
#> 6 20000 A
write_xlsx(df1, "d:/temp/test.xlsx")
# we read back ; bottom rows are missing !
df2 <- read_xlsx("d:/temp/test.xlsx")
tail(df2)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <lgl>
#> 1 19995 NA
#> 2 19996 NA
#> 3 19997 NA
#> 4 19998 NA
#> 5 19999 NA
#> 6 20000 NA
# everything is fine with guess_max = 20000
df3 <- read_xlsx("d:/temp/test.xlsx", guess_max = 20000)
tail(df3)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <chr>
#> 1 19995 A
#> 2 19996 A
#> 3 19997 A
#> 4 19998 A
#> 5 19999 A
#> 6 20000 A
So, check warnings !
To be sure you can also coerce type :
df4 <- read_xlsx("d:/temp/test.xlsx",
col_types = c("numeric", "text"))
In any case, note that integers are not recognized from the xlsx format, so you may need to transform your numbers to integers to get the exact original dataframe :
df4 %>%
mutate(x = as.integer(x)) %>%
identical(df1)
#> [1] TRUE
Upvotes: 3