Felix Zhao
Felix Zhao

Reputation: 489

R read_xlsx function to import excel files with filters

I have dozens of excel files which i would like to import into R for futher processing and analysis. These are files for each month. The source files are with filters which were to include data not needed. I wanted to use read_xlsx function from readxl package. However all rows are imported into R including the ones not needed. I know i can import all files and then do the filtering thing or just copy paste original excel files. What i want to know is if i can keep the filters in the files during the read_xlsx step which is the best option for me at the moment.

Thanks in advance,

Felix

Upvotes: 0

Views: 2188

Answers (2)

Jan Marvin
Jan Marvin

Reputation: 969

In addition to @stefan's answer find one using openxlsx2 below. Basically, rows that are filtered are hidden from the worksheet. This information is stored in the row, imported via openxlsx2 and therefore can be used to skip such rows.

library(openxlsx2)

tmp <- temp_xlsx()

# create a file
wb <- wb_workbook()$
  add_worksheet()$add_data(x = mtcars, withFilter = TRUE)$
  add_worksheet()$add_data_table(x = mtcars)$
  save(tmp)

# apply filter and save it (I choose mpg < 20 in `Sheet 1` and mpg > 20 in `Sheet 2`)
xl_open(tmp)

Now to answer the question:

# first sheet:  mpg < 20
sheet1 <- wb_to_df(tmp, sheet = 1, skipHiddenRows = TRUE)
head(sheet1)
#>     mpg cyl  disp  hp drat   wt  qsec vs am gear carb
#> 6  18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2
#> 7  18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1
#> 8  14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
#> 11 19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4
#> 12 17.8   6 167.6 123 3.92 3.44 18.90  1  0    4    4
#> 13 16.4   8 275.8 180 3.07 4.07 17.40  0  0    3    3

# second sheet: mpg > 20
sheet2 <- wb_to_df(tmp, sheet = 2, skipHiddenRows = TRUE)
head(sheet2)
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 2  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 3  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 4  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 5  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 9  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> 10 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2

PS: We are currently in the process of streamlining our API and will support snake case arguments (though the code above will still run).

Upvotes: 2

stefan
stefan

Reputation: 125797

I struggled with same issue some time ago. And after some research and AFAIK the short answer is (still) that this can't be achieved via readxl. Moreover, AFAIK there is also (still) no out-of-the-box option in packages like openxlsx or ....

However, as openxlsx could read an excel workbook object it allows to retrieve the information about the set filters which could then be used to apply the filters on the imported dataset.

The following code is an example of how this could be achieved.

Note: The code worked fine for my use-case but it is not meant as a general out-of-the-box solution so you probably have to adjust the code to fit your needs.

First, let's create an example excel file for which I use the mtcars dataset:

library(openxlsx)

wb <- createWorkbook()
addWorksheet(wb, "filter")
writeDataTable(wb, "filter", mtcars)
saveWorkbook(wb, "test-filter.xlsx")

Second, in MS Excel I have set two filters on the table, i.e. I set a filter on the cyl column and a second on the carb column:

enter image description here

Prepared with the example data the following code uses a custom function ...

scrape_filter <- function(table) {
  x <- xml2::read_xml(table)|>
    xml2::xml_child(1)|>
    xml2::xml_children()

  col_id <- x|>
    xml2::xml_attrs()|>
    purrr::map_chr(1)

  vals <- purrr::map_chr(x, function(x) {
    xml2::xml_child(x, 1)|>
      xml2::xml_children()|>
      xml2::xml_attrs()|>
      purrr::map_chr(1)|>
      paste(collapse = "; ")
  })

  data.frame(col_id, vals) |> 
    select(col_id, vals) |>
    mutate(col_id = names(dat)[as.integer(.data$col_id) + 1]) |>
    mutate(filter_cond = if_else(
      .data$vals %in% "notEqual", glue::glue("!is.na({col_id})"),
      glue::glue('{col_id} %in% str_split("{vals}", "; ")[[1]]')
    ))
}

which after loading the data and the workbook could be used to extract the information on the set filters and parses them ...

library(tidyverse)
library(xml2)
library(readxl)
library(openxlsx)
library(glue)

fn <- "test-filter.xlsx"

dat <- readxl::read_xlsx(path = fn)

wb <- loadWorkbook(fn)

filters <- scrape_filter(wb$tables)

R> filters
  col_id vals                           filter_cond
1    cyl 4; 8 cyl %in% str_split("4; 8", "; ")[[1]]
2   carb    2   carb %in% str_split("2", "; ")[[1]]

... and finally applies the filters on the imported dataset for which I use rlang::parse_exprs

filter(dat, !!!rlang::parse_exprs(filters$filter_cond))

# A tibble: 10 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2
 2  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
 3  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
 4  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
 5  15.5     8 318     150  2.76  3.52  16.9     0     0     3     2
 6  15.2     8 304     150  3.15  3.44  17.3     0     0     3     2
 7  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2
 8  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
 9  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
10  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2

Two more notes:

  • In my case the data was stored in tables and the custom function only works on tables.

  • In the example I have only one sheet and one table. In my use-case the excel file contained multiple sheets with one table per sheet. In that case you could or have to use lapply or purrr::map to iterate over the tables and the sheets.

Upvotes: 4

Related Questions