Splitting a data frame to create new columns

Question

I have a data frame with columns for "Count","Transect Number","Data", and "Year". My goal is to split up the data frame by Transect, then again by Year, and create a new data frame with a column for "Transect", and then the appropriate data per Year in the following columns.

To build a dummy data frame:

Count1<-1:27
Count2<-1:30
Count3<-1:25
T1<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
T2<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,2,3,3,3,3)
T3<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3)
Data1<-c(1,2,3,2,1,2,3,4,3,2,1,2,3,4,3,2,1,2,3,4,5,4,3,2,3,3,2)
Data2<-c(1,2,3,2,1,4,3,2,1,2,4,3,2,3,4,3,2,3,4,5,6,4,3,2,1,4,5,4,3,2)
Data3<-c(1,2,3,4,5,4,3,3,3,4,5,4,3,3,2,3,4,5,4,3,4,3,2,3,4)
Year1<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year2<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year3<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016)




DF1<-data.frame(Count1,T1,Data1,Year1)
colnames(DF1)<-c("Count","Transect","Data","Year")
DF2<-data.frame(Count2,T2,Data2,Year2)
colnames(DF2)<-c("Count","Transect","Data","Year")
DF3<-data.frame(Count3,T3,Data3,Year3)
colnames(DF3)<-c("Count","Transect","Data","Year")


All<-rbind(DF1,DF2,DF3)

Once I have the data frame, my thought was to split up the data by transect since this will be a permanent aspect of my ongoing data set.

#Step 1-Break down by T
Trans1<-All[All$Transect==1,]
Trans2<-All[All$Transect==2,]
Trans3<-All[All$Transect==3,]
Trans4<-All[All$Transect==4,]
Trans5<-All[All$Transect==5,]

But I'm a little less clear on the next step. I need to pull out data from the "Data" column organized by year. Something along the lines of further breaking down the data like so:

Trans1_Year1<-Trans1[Trans1$Year==2014,]
Trans2_Year1<-Trans2[Trans2$Year==2014,]
Trans3_Year1<-Trans3[Trans3$Year==2014,]
Trans4_Year1<-Trans4[Trans4$Year==2014,]
Trans5_Year1<-Trans5[Trans5$Year==2014,]

or even using split

ByYear1<-split(Trans1,Trans1$Year)

But I would prefer to avoid writing out the code as above as I hope to add new data every year as this data set progresses. And I'd like the code to be able to accommodate new "Year" data as it is added, as opposed to writing out new lines of code every year.

Once I have the data set up like so, I'd like to create a second data frame with columns for each year. One problem is that the each year contains differing numbers of rows, which has been an issue for me. But my final result would have columns:

 "Transect", "Data 2014", "Data 2015", "Data 2016"

Since each year has can have different numbers of rows within a transect, I'd like to leave NA's at the end of each Transect section when the number of rows per individual transect differ between years.

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

It sounds like you are basically trying to convert your data into a semi-wide format, with columns for years, rather than keeping it in the "long" format.

If this is the case, you're better off adding a secondary index column that shows the repeated combination of "Transect" and "Year".

This can easily be done with getanID from my "splitstackshape" package. "splitstackshape" also loads "data.table", from which you could then use dcast.data.table to get a wide format.

library(splitstackshape)
dcast.data.table(getanID(All, c("Transect", "Year")), 
                 Transect + .id ~ Year, value.var = "Data")
#     Transect .id 2014 2015 2016
#  1:        1   1    1    2    3
#  2:        1   2    2    1    4
#  3:        1   3    3    2    5
#  4:        1   4    1    2    4
#  5:        1   5    2    4    5
#  6:        1   6    3    3    6
#  7:        1   7    1    4    4
#  8:        1   8    2    5    4
#  9:        1   9    3    4    3
# 10:        1  10   NA   NA    4
# 11:        2   1    2    3    4
# 12:        2   2    1    4    3
# 13:        2   3    2    3    2
# 14:        2   4    2    2    3
# 15:        2   5    1    3    2
# 16:        2   6    4    4    1
# 17:        2   7    4    3    4
# 18:        2   8    5    3    3
# 19:        2   9    4    2    2
# 20:        2  10   NA   NA    3
# 21:        3   1    3    2    3
# 22:        3   2    4    1    3
# 23:        3   3    3    2    2
# 24:        3   4    3    3    5
# 25:        3   5    2    2    4
# 26:        3   6    1    3    3
# 27:        3   7    3    3    2
# 28:        3   8    3    4    4
# 29:        3   9    3    5   NA
#     Transect .id 2014 2015 2016

Then, if you really want to split on the "Transect" column you can go ahead and use split, but since you now have a "data.table" it would be better to stick with that and take advantage of its many convenient features, including those related to subsetting and aggregation.

Splitting a data frame to create new columns

Answers (2)

Related Questions