Reputation: 91
I have data in .xlsx files that have a header structure spread over two rows:
rowid CATA CATB CATC
A1 A2 A3 B1 B2 B3 C1 C2
1 1 1 2 2 3 5 5 6
...
Furthermore, the number of columns (CATA CATB etc) in the first header can change across files and also the number of columns in the second header colA1 ... colC2).
In excel the first header is indicated with merged cells, delimiting the range of columns in the second header.
I have about hundred files so I would like to have an algorithm (no manual work) for getting the data structured as:
Rowid Cat Col val
1 CATA A1 1
1 CATA A2 1
1 CATA A3 2
1 CATB B1 2
1 CATB B2 3
1 CATB B3 5
1 CATC C1 5
1 CATC C2 6
What is the best way to do this in R?
Upvotes: 3
Views: 507
Reputation: 25435
Here is a possible solution. We read the xlsx
file without headers, fill the missing values in the first row with na.locf
from zoo
and create a new header that exists of the top two rows combined, i.e. CATA---A1
, CATA---A2
, CATB---B1
, etc. We then use melt
to reshape this dataframe into long format, and use separate to split our custom header back in Cat
and Col
.
I hope this helps!
test.xlsx
library(xlsx)
library(zoo)
library(reshape2)
library(tidyr)
read_my_xlsx <- function(xlsx_name,sheet_id)
{
my_df <- xlsx::read.xlsx(xlsx_name,sheetIndex=sheet_id,header=F,colClasses='character',stringsAsFactors=FALSE)
my_df[1,] = na.locf(as.character(unlist(my_df[1,])))
my_df[1,] = c(my_df[1,1] ,sapply(2:ncol(my_df),function(x) paste0(my_df[1,x],'---',my_df[2,x])))
colnames(my_df) = my_df[1,]
my_df = my_df[-c(1,2),]
my_df = melt(my_df, id.vars=c("rowid"))
my_df = separate(my_df,variable, c("Cat", "Col"), "---")
return(my_df)
}
read_my_xlsx('test.xlsx',1)
Output:
rowid Cat Col value
1 1 CATA A1 1
2 2 CATA A1 1
3 3 CATA A1 3
4 1 CATA A2 1
5 2 CATA A2 4
6 3 CATA A2 3
7 1 CATB B1 4
8 2 CATB B1 2
9 3 CATB B1 1
10 1 CATB B2 1
11 2 CATB B2 1
12 3 CATB B2 4
13 1 CATB B3 1
14 2 CATB B3 2
15 3 CATB B3 3
16 1 CATC C1 4
17 2 CATC C1 2
18 3 CATC C1 1
19 1 CATC C2 1
20 2 CATC C2 2
21 3 CATC C2 3
Upvotes: 4