Reputation: 38619
I have an Excel file with a sheet for each week in my data set. Each sheet has the same number of rows, and each row is identical across the sheets (with the exception of the time period… sheet 1 represents week 1, sheet 2 week 2, etc.). I'm trying to import all the Excel sheets as one data frame in R.
For example, my data is essentially structured like this (with several more columns and sheets):
Week 1 sheet
ID Gender DOB Absences Lates Absences_excused
1 M 1997 5 14 5
2 F 1998 4 3 2
Week 2 sheet
ID Gender DOB Absences Lates Absences_excused
1 M 1997 2 10 3
2 F 1998 8 2 9
I'm trying to build a script that will take x numbers of sheets and combine them into one data frame like this:
Combined (ideal)
ID Gender DOB Absences.1 Lates.1 Absences.2 Lates.2
1 M 1997 5 14 2 10
2 F 1998 4 3 8 2
I'm using gdata to import the Excel files.
I've tried creating a loop (normally bad for R, I know...) that will go through all the sheets in the Excel file and add each to a list as a data frame:
library(gdata)
number_sheets <- 3
all.sheets <- vector(mode="list", length=number_sheets)
for (i in 1:number_sheets) {
all.sheets[[i]] <- read.xls("/path/to/file.xlsx", sheet=i)
}
This gives me a nice list, all.sheets
, that I can access, but I'm unsure about the best way to create a new data frame from specific columns in the list of data frames.
I've tried the code below, which creates a brand new data frame by looping through the list of data frames. On the first data frame, it saves the columns that are consistent in all the sheets, and then adds the week-specific columns.
Cleaned <- data.frame()
number_sheets <- 3
for (i in 1:number_sheets) {
if (i == 1) {
Cleaned <- all.sheets[[i]][,c("ID", "Gender", "DOB")]
}
Cleaned$Absences.i <- all.sheets[[i]][,c("Absences")] # wrong... obviously doesn't work... but essentially what I want
# Other week-specific columns go here... somehow...
}
This code doesn't work though, since Cleaned$Absences.i
is obviously not how you create dynamic columns in a data frame.
What's the best way to combine a set of data frames and create new columns for each of the variables I'm trying to track?
Extra hurdle: I'm also trying to combine two columns, "Absences" and "Absences_excused" into a single "Absences" column in the final data frame, so I'm trying to make my solution let me perform transformations to the new columns, like so (again, this isn't right):
Cleaned$Absences.i <- all.sheets[[i]][,c("Absences")] + all.sheets[[i]][,c("Absences_excused")]
Upvotes: 3
Views: 3943
Reputation: 263301
The merge strategy is:
> Week_1_sheet <- read.table(text="ID Gender DOB Absences Lates
+ 1 M 1997 5 14
+ 2 F 1998 4 3", header=TRUE)
> Week_2_sheet <- read.table(text="ID Gender DOB Absences Lates
+ 1 M 1997 2 10
+ 2 F 1998 8 2", header=TRUE)
> merge(Week_1_sheet, Week_2_sheet, 1:3)
ID Gender DOB Absences.x Lates.x Absences.y Lates.y
1 1 M 1997 5 14 2 10
2 2 F 1998 4 3 8 2
You can rename the columns with names(sheet) <- sub("x", 1, sheet)
, and again for y -> 2. I think the cbind strategy is OK but merge is probably better to learn.
@TylerRinker raises the question about acceptable arguments to the 'by' parameter. The relevant sentece in the help page is: "Columns can be specified by name, number or by a logical vector: the name "row.names" or the number 0 specifies the row names."
Upvotes: 7
Reputation: 109844
@ DWin I think the poster's problem is a little more complex than the example leads us to believe. I think the poster wants a multi merge as indicated by "week 1, sheet 2 week 2, etc.". My approach is a bit different. The extra hurdle can be taken care of before the merge using lapply with transform. Here's my solution for the merge using 3 data frames instead of 2.
#First read in three data frames
Week_1_sheet <- read.table(text="ID Gender DOB Absences Unexcused_Absences Lates
1 1 M 1997 5 1 14
2 2 F 1998 4 2 3", header=TRUE)
Week_2_sheet <- read.table(text="ID Gender DOB Absences Unexcused_Absences Lates
1 1 M 1997 2 1 10
2 2 F 1998 8 2 2
3 3 M 1998 8 2 2", header=TRUE)
Week_3_sheet <- read.table(text="ID Gender DOB Absences Unexcused_Absences Lates
1 1 M 1997 2 1 10
2 2 F 1998 8 2 2", header=TRUE)
#Put them into a list structure
WEEKlist <- list(Week_1_sheet , Week_2_sheet , Week_3_sheet)
#Transform to add the absences and unexcused absences and drop unexcused
lapply(seq_along(WEEKlist), function(x) {
WEEKlist[[x]] <<- transform(WEEKlist[[x]], Absences=sum(Absences,
Unexcused_Absences))[, -5]
}
)
#Rename each data frame in the list with `<<-` that acts on environments
lapply(seq_along(WEEKlist), function(x) {
y <- names(WEEKlist[[x]])
names(WEEKlist[[x]]) <<- c(y[1:3], paste(y[4:length(y)], ".", x, sep=""))
}
)
#loop through and merge by the common columns
DF <- WEEKlist[[1]][, 1:3]
for (.df in WEEKlist) {
DF <-merge(DF, .df, by=c('ID', 'Gender', 'DOB'), all=TRUE, suffixes=c("", ""))
}
DF
A 2nd approach (after renaming the data frame columns) is to use Reduce: Taken from (LINK)
merge.all <- function(frames, by) {
return (Reduce(function(x, y) {merge(x, y, by = by, all = TRUE)}, frames))
}
merge.all(frames=WEEKlist, by=c('ID', 'Gender', 'DOB'))
I'm not sure which one is faster though.
EDIT: On a windows 7 machine running 1000 iterations the Reduce was faster:
test replications elapsed relative user.self sys.self
1 LOOP 1000 10.12 1.62701 7.89 0
2 REDUCE 1000 6.22 1.00000 5.34 0
Upvotes: 4