Reputation: 489
I have dozens of excel files which i would like to import into R for futher processing and analysis. These are files for each month. The source files are with filters which were to include data not needed. I wanted to use read_xlsx function from readxl package. However all rows are imported into R including the ones not needed. I know i can import all files and then do the filtering thing or just copy paste original excel files. What i want to know is if i can keep the filters in the files during the read_xlsx step which is the best option for me at the moment.
Thanks in advance,
Felix
Upvotes: 0
Views: 2188
Reputation: 969
In addition to @stefan's answer find one using openxlsx2
below. Basically, rows that are filtered are hidden from the worksheet. This information is stored in the row, imported via openxlsx2
and therefore can be used to skip such rows.
library(openxlsx2)
tmp <- temp_xlsx()
# create a file
wb <- wb_workbook()$
add_worksheet()$add_data(x = mtcars, withFilter = TRUE)$
add_worksheet()$add_data_table(x = mtcars)$
save(tmp)
# apply filter and save it (I choose mpg < 20 in `Sheet 1` and mpg > 20 in `Sheet 2`)
xl_open(tmp)
Now to answer the question:
# first sheet: mpg < 20
sheet1 <- wb_to_df(tmp, sheet = 1, skipHiddenRows = TRUE)
head(sheet1)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 6 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
#> 7 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
#> 8 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
#> 11 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4
#> 12 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4
#> 13 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
# second sheet: mpg > 20
sheet2 <- wb_to_df(tmp, sheet = 2, skipHiddenRows = TRUE)
head(sheet2)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 2 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 3 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 4 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 5 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 9 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 10 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
PS: We are currently in the process of streamlining our API and will support snake case arguments (though the code above will still run).
Upvotes: 2
Reputation: 125797
I struggled with same issue some time ago. And after some research and AFAIK the short answer is (still) that this can't be achieved via readxl
. Moreover, AFAIK there is also (still) no out-of-the-box option in packages like openxlsx
or ....
However, as openxlsx
could read an excel workbook object it allows to retrieve the information about the set filters which could then be used to apply the filters on the imported dataset.
The following code is an example of how this could be achieved.
Note: The code worked fine for my use-case but it is not meant as a general out-of-the-box solution so you probably have to adjust the code to fit your needs.
First, let's create an example excel file for which I use the mtcars
dataset:
library(openxlsx)
wb <- createWorkbook()
addWorksheet(wb, "filter")
writeDataTable(wb, "filter", mtcars)
saveWorkbook(wb, "test-filter.xlsx")
Second, in MS Excel I have set two filters on the table, i.e. I set a filter on the cyl
column and a second on the carb
column:
Prepared with the example data the following code uses a custom function ...
scrape_filter <- function(table) {
x <- xml2::read_xml(table)|>
xml2::xml_child(1)|>
xml2::xml_children()
col_id <- x|>
xml2::xml_attrs()|>
purrr::map_chr(1)
vals <- purrr::map_chr(x, function(x) {
xml2::xml_child(x, 1)|>
xml2::xml_children()|>
xml2::xml_attrs()|>
purrr::map_chr(1)|>
paste(collapse = "; ")
})
data.frame(col_id, vals) |>
select(col_id, vals) |>
mutate(col_id = names(dat)[as.integer(.data$col_id) + 1]) |>
mutate(filter_cond = if_else(
.data$vals %in% "notEqual", glue::glue("!is.na({col_id})"),
glue::glue('{col_id} %in% str_split("{vals}", "; ")[[1]]')
))
}
which after loading the data and the workbook could be used to extract the information on the set filters and parses them ...
library(tidyverse)
library(xml2)
library(readxl)
library(openxlsx)
library(glue)
fn <- "test-filter.xlsx"
dat <- readxl::read_xlsx(path = fn)
wb <- loadWorkbook(fn)
filters <- scrape_filter(wb$tables)
R> filters
col_id vals filter_cond
1 cyl 4; 8 cyl %in% str_split("4; 8", "; ")[[1]]
2 carb 2 carb %in% str_split("2", "; ")[[1]]
... and finally applies the filters on the imported dataset for which I use rlang::parse_exprs
filter(dat, !!!rlang::parse_exprs(filters$filter_cond))
# A tibble: 10 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
4 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
5 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
6 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
7 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
8 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
9 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
10 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
Two more notes:
In my case the data was stored in tables and the custom function only works on tables.
In the example I have only one sheet and one table. In my use-case the excel file contained multiple sheets with one table per sheet. In that case you could or have to use lapply
or purrr::map
to iterate over the tables and the sheets.
Upvotes: 4