Reputation: 29
I have to extract a certain subsection of data from over 100 excel files. The problem is that the total rows across these files are not constant. Is there a way I can stop R from reading data based on a string variable in one of the excel?.
Upvotes: 2
Views: 11087
Reputation: 93791
You can read just the first two columns of each Excel file and then, by exploiting the regularities in the structure of each Excel file, identify and keep just the rows from the first table.
Here's an approach using the read_excel
function from the readxl
package to read in the data. We read in only the first two columns of the Excel file. read_excel
automatically trims white space and starts reading from the first non-empty row.
After reading in the data, we have a data frame with two columns. We now need to make the following changes:
library(readxl)
# Read first two columns
df = read_excel("Workbook1.xlsx", col_types=c("text","numeric"), range=cell_cols("A:B"))
# Reset column names
names(df) = c("Project", "Hours")
# Remove first row (which contains the column names form the excel file)
df = df[-1, ]
# Remove rows after the end of the first table
df = df[1:(which(df[,1]=="Time Recording Details") - 2), ]
Here's the output using the sample Excel file shown below:
df
Project Hours 1 A 1 2 B 2 3 A 3 4 B 4 5 A 5 6 B 6 7 A 7
To read multiple files, you can read them all into a list of data frames as follows:
files = list.files(pattern="xlsx")
df.list = lapply(files, function(f) {
# Read first two columns
df = read_excel(f, col_types=c("text","numeric"), range=cell_cols("A:B"))
# Reset column names
names(df) = c("Project", "Hours")
# Remove first row (which contains the column names form the excel file)
df = df[-1, ]
# Remove rows after the end of the first table
df[1:(which(df[,1]=="Time Recording Details") - 2), ]
})
Here's a picture of the sample Excel file I created to test the code:
Upvotes: 1