Extract data from excel using R

I have to extract a certain subsection of data from over 100 excel files. The problem is that the total rows across these files are not constant. Is there a way I can stop R from reading data based on a string variable in one of the excel?.

Image given Here

Upvotes: 2

Views: 11087

Answers (1)

eipi10
eipi10

Reputation: 93791

You can read just the first two columns of each Excel file and then, by exploiting the regularities in the structure of each Excel file, identify and keep just the rows from the first table.

Here's an approach using the read_excel function from the readxl package to read in the data. We read in only the first two columns of the Excel file. read_excel automatically trims white space and starts reading from the first non-empty row.

After reading in the data, we have a data frame with two columns. We now need to make the following changes:

  1. Reset the column names to the correct values.
  2. Remove the first data row, since those were originally the column names
  3. Remove rows after the first table. Since the second table begins with "Time Recording Details", we keep data frame rows only up to the row number that is two less than the row number where this text appears in the first column.

library(readxl)

# Read first two columns
df = read_excel("Workbook1.xlsx", col_types=c("text","numeric"), range=cell_cols("A:B"))

# Reset column names
names(df) = c("Project", "Hours")

# Remove first row (which contains the column names form the excel file)
df = df[-1, ]

# Remove rows after the end of the first table
df = df[1:(which(df[,1]=="Time Recording Details") - 2), ]

Here's the output using the sample Excel file shown below:

df 
  Project Hours
1       A     1
2       B     2
3       A     3
4       B     4
5       A     5
6       B     6
7       A     7

To read multiple files, you can read them all into a list of data frames as follows:

files = list.files(pattern="xlsx")

df.list = lapply(files, function(f) {

  # Read first two columns
  df = read_excel(f, col_types=c("text","numeric"), range=cell_cols("A:B"))

  # Reset column names
  names(df) = c("Project", "Hours")

  # Remove first row (which contains the column names form the excel file)
  df = df[-1, ]

  # Remove rows after the end of the first table
  df[1:(which(df[,1]=="Time Recording Details") - 2), ]

})

Here's a picture of the sample Excel file I created to test the code:

enter image description here

Upvotes: 1

Related Questions