Reputation: 87
I have two large data sets and I am attempting to reformat the older data set to put the questions in the same order as the newer data set (so that I can easily perform t-tests on each identical question to track significant changes over the 2 years between data sets). The new version both deleted and added questions when changing from the old version.
The way I've been attempting to do this, R keeps crashing due to, as best I can figure, vectors being too large. I'm not sure how they are getting to be this large, however! Below is what I am doing:
Both data sets have the same format. The original sets are 415 for the new and 418 for the old. I want to match the first approximately 158 colums of the new data set to the old. Each data set has column names which are q1-q415 and the data in each column is numerical 1-5 or NA. There are approximately 100 answers per question/column, the old data set has more respondants (140 rows in old vs 114 rows in new). An example is below (but keep in mind there are over 400 columns in the full set and over 100 rows!)
The following is an example of what data.old looks like. data.new looks the same only data.new has more Rows of number/na answers. Here I show questions 1 through 20 and the first 10 rows. data.old = 418 columns (q1 though q418) x 140 rows data.new = 415 columns (q1 through q415) x 114 rows I need to match the first 170 COLUMNS of data.old to the first 157 COLUMNS of data.new To do this, I will be deleting 17 columns from data.old (questions that were in the data.old questionnaire and deleted from the data.new questionnaire) but also adding 7 new columns to data.old (which will contain NAs... place holders for where data.new had new questions introducted that did not exist in data.old questionnaire)
>data.old
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20
1 3 4 3 3 5 4 1 NA 4 NA 1 2 NA 5 4 3 2 3 1
3 4 5 2 2 4 NA 1 3 2 5 2 NA 3 2 1 4 3 2 NA
2 NA 2 3 2 1 4 3 5 1 2 3 4 3 NA NA 2 1 2 5
1 2 4 1 2 5 2 3 2 1 3 NA NA 2 1 5 5 NA 2 3
4 3 NA 2 1 NA 3 4 2 2 1 4 5 5 NA 3 2 3 4 1
5 2 1 5 3 2 3 3 NA 2 1 5 4 3 4 5 3 NA 2 NA
NA 2 4 1 5 5 NA NA 2 NA 1 3 3 3 4 4 5 5 3 1
4 5 4 5 5 4 3 4 3 2 5 NA 2 NA 2 3 5 4 5 4
2 2 3 4 1 5 5 3 NA 2 1 3 5 4 NA 2 3 4 3 2
2 1 5 3 NA 2 3 NA 4 5 5 3 2 NA 2 3 1 3 2 4
So in the new set, some of the questions were deleted, some new ones were added, and some changed order, so I went through and created subsets of old data in the order that I would need to combine them again to match the new dataset. When a question does not exist in the old data set, I want to use the question in the new data set so that I can (theoretically) perform my t-tests in a big loop.
dataold.set1 <- dataold[1:16]
dataold.set2 <- dataold[18:19]
dataold.set3 <- dataold[21:23]
dataold.set4 <- dataold[25:26]
dataold.set5 <- dataold[30:33]
dataold.set6 <- dataold[35:36]
dataold.set7 <- dataold[38:39]
dataold.set8 <- dataold[41:42]
dataold.set9 <- dataold[44]
dataold.set10 <- dataold[46:47]
dataold.set11 <- dataold[49:54]
dataold.set12 <- datanew[43:49]
dataold.set13 <- dataold[62:85]
dataold.set14 <- dataold[87:90]
dataold.set15 <- datanew[78]
dataold.set16 <- dataold[91:142]
dataold.set17 <- dataold[149:161]
dataold.set18 <- dataold[55:61]
dataold.set19 <- dataold[163:170]
I then was attempting to put the columns back together into one set I tried both
dataold.adjust <- merge(dataold.set1, dataold.set2)
dataold.adjust <- merge(dataold.adjust, dataold.set3)
dataold.adjust <- merge(dataold.adjust, dataold.set4)
and I also tried
dataold.adjust <- cbind(dataold.set1, dataold.set2, dataold.set3)
However, every time I try to perform these functions, R freezes, then crashes. I managed to get it to display an error once, and it said it could not work with a vector of 10 Mb, and then I got multiple errors involving over 1000 Mb vectors. I'm not really sure how my vectors are that large, when this is crashing out by set 3, which is only 23 columns of data in a table, and the data sets I'm normally using are over 400 columns in length.
Is there another way to do this that won't cause my program to crash and have memory issues (and won't require me typing out the column names of over 100 columns), or is there some element of code here that I am missing where I'm getting a memory sink? I've been attempting to trouble shoot it and have spent an hour dealing with R crashing without any luck figuring out how to make this work.
Thanks for the assistance!
Upvotes: 1
Views: 1615
Reputation: 176648
You're making tons of unnecessary copies of your data and then you're growing the final object (dataold.adjust
). You just need a vector that orders the columns correctly:
cols1 <- c(1:16,18:19,21:23,25:26,30:33,35:36,38:39,41:42,44,46:47,49:54)
cols2 <- c(62:85,87:90)
cols3 <- c(91:142,149:161,55:61,163:170)
# merge old / new data by row and add NA for unmatched rows
dataold.adjust <- merge(data.old[,c(cols1,cols2,cols3)],
data.new[,c(43:49,78)], by="row.names", all=TRUE)
# put columns in desired order
dataold.adjust <- dataold.adjust[,c(1:length(cols1), # 1st cols from dataold
ncol(dataold.adjust)-length(43:49):1, # 1st cols from datanew
(length(cols1)+1):length(cols2), # 2nd cols from dataold
ncol(dataold.adjust), # 2nd cols from datanew
(length(cols1)+length(cols2)+1):length(cols3))] # 3rd cols from dataold
The last part is an absolute kludge, but I've hit my self-imposed time limit for SO today. :)
Upvotes: 5