Reputation: 97
I have a lot of long Excel files and it is too hard to handle them manually. I'm reading them in R to identify the highlighted yellow cells as in the image
The objective: is to loop over the days and the hours in the file in order to construct a data frame that indicates the option according to the hour as
I am following these answers: answer1, answer2, answer3 to do the job using the libraries xlsx
, openxlsx
and tidyr
library(xlsx)
library(openxlsx)
library(tidyr)
wb <- loadWorkbook("active.xlsx") #the table is saved in the file active.xlsx
sheet1 <- getSheets(wb)[[1]]
rows <- getRows(sheet1)
cells <- getCells(rows)
styles <- sapply(cells, getCellStyle)
cellColor <- function(style)
{
fg <- style$getFillForegroundXSSFColor()
rgb <- tryCatch(fg$getRgb(), error = function(e) NULL)
rgb <- paste(rgb, collapse = "")
return(rgb)
}
mycolor <- (yellow = "ffff00")
m <- match(sapply(styles, cellColor), mycolor)
But the data is neither read nor processed correctly and the code is not yielding the needed result, I am not even close!
Is it possible to guide me and link a tutorial or a package in R which I can use to detect the highlighted cells and to construct the required dataframe?
Upvotes: 0
Views: 1221
Reputation: 922
I made an excel file similar to one shown named openxlsx highlighted cell.xlsx
Using openxlsx, create a workbook then load existing "openxlsx highlighted cell.xlsx". This will preserve the format of the excelworkbook, but not reading in the data. If using RStudio IDE you can look and View the existing styles in the workbook
library(tidyverse)
library(openxlsx)
wb <- createWorkbook()
wb <- openxlsx::loadWorkbook(file = "path_to_file\\openxlsx highlighted cell.xlsx",isUnzipped = F)
wb$styleObjects #shows all styles
Looking at styles, the second style has the yellow highlighted cell as indicated by rgb value, it also lists the corresponding rows and column coordinates of the cells with this format.
Next, make vectors of the rows and columns, these will be used later
highlighted_rows <- wb$styleObjects[[2]]$rows
highlighted_cols <- wb$styleObjects[[2]]$cols
Next, read in the data as a separate workbook, this will read in the data as it appears in the excel. since one of these is a merged cell and is important for your date, specify that the merge cell should be filled. This will enable you to track the date/option combination. Set colnames to FALSE since the first row will be used as data and not column names
wb_read <- createWorkbook()
wb_read <- readWorkbook(xlsxFile = "path_to_file\\openxlsx highlighted cell.xlsx",fillMergedCells = T,colNames = FALSE)
Excel did not like the time row so I had to rewrite the time values (also converted 12:30:00 to 00:30:00), but this could be due to how I formatted the excel originally
times <- seq(from=as.POSIXct("2005-02-05 00:30:00"),to=as.POSIXct("2005-02-05 09:30:00"),by="30 min") %>%
substr(start = 12, stop = 20)
wb_read[1,3:21] <- times
Only applied this to first row since row 15 is duplicate of row 1 for time
You essentially will need indices of where the times and options are located in their respective series. So we make two vectors, one that has the time position - this will include the initial A and B columns that don't have time values
times_positions <- wb_read[1,]
times_positions <- times_positions %>%
t()
And similarly where the option value is and date value (this is the merged cell that was filled in with earlier code)
option_index <- wb_read %>%
pull(X2)
date_index <- wb_read %>%
pull(X1)
Since you also want to retain NA values or times with there is no highlighted cell, you'll also need an expanded dataset to later join the highlighted cell data with. This code could be improved with a function, but I left as is for demo/there were just 2 dates:
date1 <- seq(from=as.POSIXct("2005-01-05 00:30:00"),to=as.POSIXct("2005-01-05 09:30:00"),by="30 min")
date2 <- seq(from=as.POSIXct("2005-02-05 00:30:00"),to=as.POSIXct("2005-02-05 09:30:00"),by="30 min")
date1 <- as_tibble(date1) %>%
set_names("X")
date2 <- as_tibble(date2) %>%
set_names("X")
all_date_times <- bind_rows(date1,date2)
Then it is just bringing together the highlighted rows/cells indexing the vectors for what that should be, and joining with all combinations to identify where there would be NA highlights.
df <- as_tibble(bind_cols(highlighted_rows,highlighted_cols)) %>%
set_names("rows", "columns") %>%
mutate(Option = option_index[rows],
time = times_positions[columns],
date = mdy(date_index[rows])) %>% #change to date here
dplyr::select(Option, time, date) %>%
mutate(date_time = paste(date,time) %>% strptime(.,format = "%Y-%m-%d %H:%M:%S")) %>%
right_join(all_date_times, by = c("date_time" = "X")) %>%
arrange(date_time) %>%
dplyr::select(date_time, Option) %>%
rename("Date" = "date_time")
df %>%
View()
Upvotes: 1
Reputation: 51
Check out the free online book'Spreadsheet Munging Strategies':
https://nacnudus.github.io/spreadsheet-munging-strategies/ with the tidyxl package.
In the case of colored cells:
https://nacnudus.github.io/spreadsheet-munging-strategies/tidy-formatted-cells.html
Upvotes: 2