Reputation: 69
I'm attempting to write an R script that will loop through about 100 small Excel spreadsheets and stack the 15-20 rows that have full cells (the exact number of rows varies in each small file) into one.
This code that I'm using (contained within the loop) reads the files in, but the columns that do not contain any entries are skipped.
newDf <- read.xlsx(filename,sheetName="Coding",header=FALSE,
rowIndex=rows,colIndex=columns)
My current workaround is to create a dummy first dataframe with values in every column; when I then rbind each newDf to it, the columns are forced to line up. But I'm hoping there's a better way.
Any help would be appreciated. Thank you.
Upvotes: 3
Views: 4539
Reputation: 34763
Maybe try using read.xlsx2
?
I saved an Excel file with this format and named it test.xlsx
:
id blank names
1: 1 John
2: 2 Jacob
3: 3 Jingel
4: 4 Heimer
5: 5 Schmidt
6: 6 Joe
7: 7 Public
8: 8 Jane
9: 9 Doe
10: 10 Ramsey
And another where I simply deleted the column header, which i named test2.xlsx
:
id names
1: 1 John
2: 2 Jacob
3: 3 Jingel
4: 4 Heimer
5: 5 Schmidt
6: 6 Joe
7: 7 Public
8: 8 Jane
9: 9 Doe
10: 10 Ramsey
When I read them in with read.xlsx
, blank
is skipped only when it's not named:
> read.xlsx("test.xlsx",sheetIndex=1)
id blank names
1 1 NA John
2 2 NA Jacob
3 3 NA Jingel
4 4 NA Heimer
5 5 NA Schmidt
6 6 NA Joe
7 7 NA Public
8 8 NA Jane
9 9 NA Doe
10 10 NA Ramsey
> read.xlsx("test2.xlsx",sheetIndex=1)
id names
1 1 John
2 2 Jacob
3 3 Jingel
4 4 Heimer
5 5 Schmidt
6 6 Joe
7 7 Public
8 8 Jane
9 9 Doe
10 10 Ramsey
However, if I use read.xlsx2
on test2.xlsx
:
> read.xlsx2("test.xlsx",sheetIndex=1)
id X. names
1 1 John
2 2 Jacob
3 3 Jingel
4 4 Heimer
5 5 Schmidt
6 6 Joe
7 7 Public
8 8 Jane
9 9 Doe
10 10 Ramsey
Now it's read in as a factor
column named X.
.
If I try and extend this to have a bunch of empty columns, here's what I get:
> read.xlsx2("test3.xlsx",sheetIndex=1)
id X. names X..1 names2 X..2 names3 X..3 names4
1 1 John John John John
2 2 Jacob Jacob Jacob Jacob
3 3 Jingel Jingel Jingel Jingel
4 4 Heimer Heimer Heimer Heimer
5 5 Schmidt Schmidt Schmidt Schmidt
6 6 Joe Joe Joe Joe
7 7 Public Public Public Public
8 8 Jane Jane Jane Jane
9 9 Doe Doe Doe Doe
10 10 Ramsey Ramsey Ramsey Ramsey
So the name creation by read.xlsx2
is predictable, as well. Also (I can't pin down where I heard this so I can't source it), I think it's good practice to use read.xlsx2
in general anyway--faster, etc.
I can also add this from ?read.xlsx
:
The read.xlsx2 function does more work in Java so it achieves better performance (an order of magnitude faster on sheets with 100,000 cells or more). The result of read.xlsx2 will in general be different from read.xlsx, because internally read.xlsx2 uses readColumns which is tailored for tabular data.
So @user1578653's suggestion to use the underlying readColumns
is covered by just using read.xlsx
.
Upvotes: 1
Reputation: 5038
I'm not sure why read.xlsx has this behaviour...
There are a couple of potential solutions. The first is to use the readColumns function in the XLSX package, instead of read.xlsx:
wb <- loadWorkbook("c:/path/to/your/xlsx/file")
sheets <- getSheets(wb)
sheet <- sheets[["NAMEOFSHEET"]]
newDf <- readColumns(sheet, startColumn=1, endColumn=3, startRow=1, endRow=4)
Alternatively, you can continue reading the files as you are, using read.xlsx, and just add blank columns in afterwards:
if("COLUMN_THAT_SHOULD_BE_THERE" %in% colnames(newDf) == FALSE){
newDf$COLUMN_THAT_SHOULD_BE_THERE <- NA
}
Upvotes: 2
Reputation: 18612
If you're willing to use the XLConnect
package rather than xlsx
, the function XLConnect::readWorksheet
allows you to specify the startCol
, endCol
, and autofitCol
arguments, which would solve this issue for you:
library(XLConnect)
##
# wb1 <- loadWorkbook("~/tmp/tmp1.xlsx", create = FALSE)
# wb2 <- loadWorkbook("~/tmp/tmp2.xlsx", create = FALSE)
df1 <- readWorksheet(
object = wb1,
sheet = "sheet1",
startCol = 1,
endCol = 3,
autofitCol = FALSE)
##
df2 <- readWorksheet(
object = wb2,
sheet = "sheet1",
startCol = 1,
endCol = 3,
autofitCol = FALSE)
##
R> head(df1,3)
A B Col3
1 1 6 NA
2 2 7 NA
3 3 8 NA
R> head(df2,3)
A B C
1 1 6 11
2 2 7 12
3 3 8 13
where df1
was read from a worksheet containing only two non-empty columns (see data below).
Data:
wb1 <- loadWorkbook("~/tmp/tmp1.xlsx", create = TRUE)
createSheet(wb1, "sheet1")
writeWorksheet(
object = wb1,
data = data.frame(
A = 1:5,
B = 6:10),
sheet = "sheet1")
saveWorkbook(wb1)
##
wb2 <- loadWorkbook("~/tmp/tmp2.xlsx", create = TRUE)
createSheet(wb2, "sheet1")
writeWorksheet(
object = wb2,
data = data.frame(
A = 1:5,
B = 6:10,
C = 11:15),
sheet = "sheet1")
saveWorkbook(wb2)
Upvotes: 2