Vinterwoo
Vinterwoo

Reputation: 3941

Splitting a data frame to create new columns

I have a data frame with columns for "Count","Transect Number","Data", and "Year". My goal is to split up the data frame by Transect, then again by Year, and create a new data frame with a column for "Transect", and then the appropriate data per Year in the following columns.

To build a dummy data frame:

Count1<-1:27
Count2<-1:30
Count3<-1:25
T1<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
T2<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,2,3,3,3,3)
T3<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3)
Data1<-c(1,2,3,2,1,2,3,4,3,2,1,2,3,4,3,2,1,2,3,4,5,4,3,2,3,3,2)
Data2<-c(1,2,3,2,1,4,3,2,1,2,4,3,2,3,4,3,2,3,4,5,6,4,3,2,1,4,5,4,3,2)
Data3<-c(1,2,3,4,5,4,3,3,3,4,5,4,3,3,2,3,4,5,4,3,4,3,2,3,4)
Year1<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year2<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year3<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016)




DF1<-data.frame(Count1,T1,Data1,Year1)
colnames(DF1)<-c("Count","Transect","Data","Year")
DF2<-data.frame(Count2,T2,Data2,Year2)
colnames(DF2)<-c("Count","Transect","Data","Year")
DF3<-data.frame(Count3,T3,Data3,Year3)
colnames(DF3)<-c("Count","Transect","Data","Year")


All<-rbind(DF1,DF2,DF3)

Once I have the data frame, my thought was to split up the data by transect since this will be a permanent aspect of my ongoing data set.

#Step 1-Break down by T
Trans1<-All[All$Transect==1,]
Trans2<-All[All$Transect==2,]
Trans3<-All[All$Transect==3,]
Trans4<-All[All$Transect==4,]
Trans5<-All[All$Transect==5,]

But I'm a little less clear on the next step. I need to pull out data from the "Data" column organized by year. Something along the lines of further breaking down the data like so:

Trans1_Year1<-Trans1[Trans1$Year==2014,]
Trans2_Year1<-Trans2[Trans2$Year==2014,]
Trans3_Year1<-Trans3[Trans3$Year==2014,]
Trans4_Year1<-Trans4[Trans4$Year==2014,]
Trans5_Year1<-Trans5[Trans5$Year==2014,]

or even using split

ByYear1<-split(Trans1,Trans1$Year)

But I would prefer to avoid writing out the code as above as I hope to add new data every year as this data set progresses. And I'd like the code to be able to accommodate new "Year" data as it is added, as opposed to writing out new lines of code every year.

Once I have the data set up like so, I'd like to create a second data frame with columns for each year. One problem is that the each year contains differing numbers of rows, which has been an issue for me. But my final result would have columns:

 "Transect", "Data 2014", "Data 2015", "Data 2016"

Since each year has can have different numbers of rows within a transect, I'd like to leave NA's at the end of each Transect section when the number of rows per individual transect differ between years.

Upvotes: 3

Views: 161

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

It sounds like you are basically trying to convert your data into a semi-wide format, with columns for years, rather than keeping it in the "long" format.

If this is the case, you're better off adding a secondary index column that shows the repeated combination of "Transect" and "Year".

This can easily be done with getanID from my "splitstackshape" package. "splitstackshape" also loads "data.table", from which you could then use dcast.data.table to get a wide format.

library(splitstackshape)
dcast.data.table(getanID(All, c("Transect", "Year")), 
                 Transect + .id ~ Year, value.var = "Data")
#     Transect .id 2014 2015 2016
#  1:        1   1    1    2    3
#  2:        1   2    2    1    4
#  3:        1   3    3    2    5
#  4:        1   4    1    2    4
#  5:        1   5    2    4    5
#  6:        1   6    3    3    6
#  7:        1   7    1    4    4
#  8:        1   8    2    5    4
#  9:        1   9    3    4    3
# 10:        1  10   NA   NA    4
# 11:        2   1    2    3    4
# 12:        2   2    1    4    3
# 13:        2   3    2    3    2
# 14:        2   4    2    2    3
# 15:        2   5    1    3    2
# 16:        2   6    4    4    1
# 17:        2   7    4    3    4
# 18:        2   8    5    3    3
# 19:        2   9    4    2    2
# 20:        2  10   NA   NA    3
# 21:        3   1    3    2    3
# 22:        3   2    4    1    3
# 23:        3   3    3    2    2
# 24:        3   4    3    3    5
# 25:        3   5    2    2    4
# 26:        3   6    1    3    3
# 27:        3   7    3    3    2
# 28:        3   8    3    4    4
# 29:        3   9    3    5   NA
#     Transect .id 2014 2015 2016

Then, if you really want to split on the "Transect" column you can go ahead and use split, but since you now have a "data.table" it would be better to stick with that and take advantage of its many convenient features, including those related to subsetting and aggregation.

Upvotes: 3

Mike Wise
Mike Wise

Reputation: 22847

I think you are forcing your data into a format it does not have naturally. There are a lot of processing advantages to leaving it in "long" format. Have a look at this article if you have not seen it yet, it is a classic.

http://www.jstatsoft.org/v21/i12

Upvotes: 0

Related Questions