Vin
Vin

Reputation: 99

Unable to import whole data from an excel using R

I have an excel file (.xlsx) with 15000 records which I loaded to R, and there is a column 'X' which has data after 10000 rows.

Data <- read_excel("Business_Data.xlsx", sheet = 3, skip = 2)

When I checked the dataframe after importing file, I could see only NA in that 'X' column. Rather, column X has factors like "Cost +, Resale-, Purchase" which are not getting captured. Is it because the data for this column contains after 10000 records? Or am I missing something?

Upvotes: 1

Views: 579

Answers (1)

mdag02
mdag02

Reputation: 1165

read_excel tries to infer the type of the data using the first 1000 rows by default. If it can't get the right type and can't coerce the data to this type, you'll get NA.

You probably had a warning : "There were 50 or more warnings (use warnings() to see the first 50)"

And checking the warnings tells you something like :

> warnings()
Messages d'avis :
1: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet,  ... :
  Expecting logical in B15002 / R15002C2: got 'A'
...

Solution : add the argument guess_max = 20000

library(tidyverse)
library(writexl)
library(readxl)

# create a dataframe with a character column "empty" at the beginning
df1 <- tibble(x = 1:20000,
              y = c(rep(NA_character_, 15000), rep("A", 5000)))

# bottom rows are OK
tail(df1)

#> # A tibble: 6 x 2
#>       x y    
#>   <int> <chr>
#> 1 19995 A    
#> 2 19996 A    
#> 3 19997 A    
#> 4 19998 A    
#> 5 19999 A    
#> 6 20000 A

write_xlsx(df1, "d:/temp/test.xlsx")

# we read back ; bottom rows are missing !
df2 <- read_xlsx("d:/temp/test.xlsx")
tail(df2)

#> # A tibble: 6 x 2
#>       x y    
#>   <dbl> <lgl>
#> 1 19995 NA   
#> 2 19996 NA   
#> 3 19997 NA   
#> 4 19998 NA   
#> 5 19999 NA   
#> 6 20000 NA

# everything is fine with guess_max = 20000
df3 <- read_xlsx("d:/temp/test.xlsx", guess_max = 20000)
tail(df3)

#> # A tibble: 6 x 2
#>       x y    
#>   <dbl> <chr>
#> 1 19995 A    
#> 2 19996 A    
#> 3 19997 A    
#> 4 19998 A    
#> 5 19999 A    
#> 6 20000 A

So, check warnings !

To be sure you can also coerce type :

df4 <- read_xlsx("d:/temp/test.xlsx",
                 col_types = c("numeric", "text"))

In any case, note that integers are not recognized from the xlsx format, so you may need to transform your numbers to integers to get the exact original dataframe :

df4 %>% 
  mutate(x = as.integer(x)) %>% 
  identical(df1)
#> [1] TRUE

Upvotes: 3

Related Questions