Reputation: 474
I have a collection of a dozen excel files I am reading in to a list of dataframes in R with the following code:
data_path <- "path"
files <- dir(data_path, pattern = "*.xlsx")
data <- files %>%
map(~readWorkbook(file.path(data_path, .), sheet = "Results"))
This grabs everything no problems. The issue is that I need them all in the same format for further manipulation and, due to a non-universal layout, some are imported like this:
X1 2016 2017 2018
y 12 12 12
and others like this:
Result
y 2016 2017 2018
x 12 12 12
The reason is because some excel files are forwarded to me with an additional row at the top with the string character 'Results'
Now I could fix this with direct surgery to each one:
names(data) <- rbind(data[1,])
names(data)[1] <- "X1"
data <- data[-c(1),]
But this seems like a rather ugly hack solution that will lead to automation problems down the line. Is there a way to use the readWorkbook() function but specify to skip rows if they contain certain values?
eg perhaps something like:
if value equal to 'Result' {
skipRow()
}
Or to search dataframes for rows of dates and use these as column names?
Upvotes: 2
Views: 626
Reputation: 1843
So, the easiest solution I can think of here is something like this.
First, import the xlsx
files with colNames = FALSE
like so:
data <- files %>%
map(~readWorkbook(file.path(getwd(), .), sheet = "Sheet1", colNames = FALSE))
Now all you need to do is
- remove the first row if it contains "Result" in the first column
- assign each xlsx
file to its own data frame (optional)
- set the column names for each of these files (optional)
This can be done like so:
for(i in 1:length(data)){
data[[i]] %<>% filter(X1 != "Result") #Alternatively data[[i]] <- data[[i]] %>% filter(X1 != "Result")
assign(paste0("FileName", i), as.data.frame(data[[i]]))
names(paste0("FileName", i)) <- c("Names", "For", "Your", "Columns")
}
Please note the usage of the reverse pipe %<>%
(from the package magrittr
) in the first statement inside the for loop.
Note: this will remove any and all rows that contain the string "Result" in the first column.
Upvotes: 1