Read in Multiple Excel Files with Different Layouts in R

Question

I have a collection of a dozen excel files I am reading in to a list of dataframes in R with the following code:

data_path <- "path"
files <- dir(data_path, pattern = "*.xlsx")

data <- files %>%
  map(~readWorkbook(file.path(data_path, .), sheet = "Results"))

This grabs everything no problems. The issue is that I need them all in the same format for further manipulation and, due to a non-universal layout, some are imported like this:

X1     2016     2017     2018
y       12       12       12

and others like this:

Result
y         2016       2017       2018
x          12         12         12

The reason is because some excel files are forwarded to me with an additional row at the top with the string character 'Results'

Now I could fix this with direct surgery to each one:

names(data) <- rbind(data[1,])
names(data)[1] <- "X1"
data <- data[-c(1),]

But this seems like a rather ugly hack solution that will lead to automation problems down the line. Is there a way to use the readWorkbook() function but specify to skip rows if they contain certain values?

eg perhaps something like:

if value equal to 'Result' {
  skipRow()
}

Or to search dataframes for rows of dates and use these as column names?

Dunois · Accepted Answer

So, the easiest solution I can think of here is something like this.

First, import the xlsx files with colNames = FALSE like so:

data <- files %>%
  map(~readWorkbook(file.path(getwd(), .), sheet = "Sheet1", colNames = FALSE))

Now all you need to do is - remove the first row if it contains "Result" in the first column - assign each xlsx file to its own data frame (optional) - set the column names for each of these files (optional)

This can be done like so:

for(i in 1:length(data)){
  data[[i]] %<>% filter(X1 != "Result") #Alternatively data[[i]] <- data[[i]] %>% filter(X1 != "Result")
  assign(paste0("FileName", i), as.data.frame(data[[i]]))
  names(paste0("FileName", i)) <- c("Names", "For", "Your", "Columns")
}

Please note the usage of the reverse pipe %<>% (from the package magrittr) in the first statement inside the for loop.

Note: this will remove any and all rows that contain the string "Result" in the first column.

Read in Multiple Excel Files with Different Layouts in R

Answers (1)

Related Questions