Rense
Rense

Reputation: 91

How to handle excel files with double headers

I have data in .xlsx files that have a header structure spread over two rows:

rowid   CATA      CATB    CATC
        A1 A2 A3  B1 B2 B3  C1 C2 
1       1  1  2   2  3  5   5  6 
...

Furthermore, the number of columns (CATA CATB etc) in the first header can change across files and also the number of columns in the second header colA1 ... colC2).

In excel the first header is indicated with merged cells, delimiting the range of columns in the second header.

I have about hundred files so I would like to have an algorithm (no manual work) for getting the data structured as:

Rowid Cat  Col val 
1     CATA A1  1 
1     CATA A2  1 
1     CATA A3  2
1     CATB B1  2 
1     CATB B2  3 
1     CATB B3  5 
1     CATC C1  5 
1     CATC C2  6

What is the best way to do this in R?

Upvotes: 3

Views: 507

Answers (1)

Florian
Florian

Reputation: 25435

Here is a possible solution. We read the xlsx file without headers, fill the missing values in the first row with na.locf from zoo and create a new header that exists of the top two rows combined, i.e. CATA---A1, CATA---A2, CATB---B1, etc. We then use melt to reshape this dataframe into long format, and use separate to split our custom header back in Cat and Col.

I hope this helps!


test.xlsx

enter image description here


library(xlsx)
library(zoo)
library(reshape2)
library(tidyr)

read_my_xlsx <- function(xlsx_name,sheet_id)
{
  my_df <- xlsx::read.xlsx(xlsx_name,sheetIndex=sheet_id,header=F,colClasses='character',stringsAsFactors=FALSE)
  my_df[1,] = na.locf(as.character(unlist(my_df[1,])))
  my_df[1,] = c(my_df[1,1] ,sapply(2:ncol(my_df),function(x) paste0(my_df[1,x],'---',my_df[2,x])))
  colnames(my_df) = my_df[1,]
  my_df = my_df[-c(1,2),]
  my_df = melt(my_df, id.vars=c("rowid"))
  my_df = separate(my_df,variable, c("Cat", "Col"), "---")
  return(my_df)
}

read_my_xlsx('test.xlsx',1)

Output:

   rowid  Cat Col value
1      1 CATA  A1     1
2      2 CATA  A1     1
3      3 CATA  A1     3
4      1 CATA  A2     1
5      2 CATA  A2     4
6      3 CATA  A2     3
7      1 CATB  B1     4
8      2 CATB  B1     2
9      3 CATB  B1     1
10     1 CATB  B2     1
11     2 CATB  B2     1
12     3 CATB  B2     4
13     1 CATB  B3     1
14     2 CATB  B3     2
15     3 CATB  B3     3
16     1 CATC  C1     4
17     2 CATC  C1     2
18     3 CATC  C1     1
19     1 CATC  C2     1
20     2 CATC  C2     2
21     3 CATC  C2     3

Upvotes: 4

Related Questions