Reputation: 34733
So I've got data that looks like this:
id year principal interest
1: 011000600 2013 0.00 0.00
2: 011000600 2014 544.03 0.00
3: 011000700 2013 0.00 0.00
4: 011000700 2014 0.01 0.00
5: 011000800 2013 363.44 12.79
6: 011000800 2014 2005.98 0.00
7: 011000900 2013 0.00 0.00
8: 011000900 2014 0.00 0.00
9: 011001000 2013 0.00 0.00
10: 011001000 2014 0.00 0.00
11: 011001100 2013 0.00 0.00
12: 011001100 2014 1723.24 0.00
13: 011001560 2013 0.00 0.00
14: 011001560 2014 0.00 0.00
15: 011001650 2013 0.00 0.00
16: 011001650 2014 0.00 0.00
(basically a longitudinal sample of a bunch of variables)
The data is on the large side so I'm using data.table
for everything. I reshape it to get each id
unique by row:
datam<-melt(data,id=c("id","year"))
data1<-dcast.data.table(datam,id~...)
This yields:
id 2013_principal 2013_interest 2014_principal 2014_interest
1: 011000600 0.00 0.00 544.03 0
2: 011000700 0.00 0.00 0.01 0
3: 011000800 363.44 12.79 2005.98 0
4: 011000900 0.00 0.00 0.00 0
5: 011001000 0.00 0.00 0.00 0
6: 011001100 0.00 0.00 1723.24 0
This is course the form of data that I want, but having column names start with numbers is a pain in the keester.
Any suggestions for how to deal with this? I'd much rather have:
id principal_2013 interest_2013 principal_2014 interest_2014
1: 011000600 0.00 0.00 544.03 0
2: 011000700 0.00 0.00 0.01 0
3: 011000800 363.44 12.79 2005.98 0
4: 011000900 0.00 0.00 0.00 0
5: 011001000 0.00 0.00 0.00 0
6: 011001100 0.00 0.00 1723.24 0
(switching the year to be a suffix) I've tried being more explicit when casting, e.g.
data2<-dcast.data.table(datam,id~year+...)
data3<-dcast.data.table(datam,id~...+year)
To no avail:
data2
id 2013_principal 2013_interest 2014_principal 2014_interest
1: 011000600 0.00 0.00 544.03 0
2: 011000700 0.00 0.00 0.01 0
3: 011000800 363.44 12.79 2005.98 0
4: 011000900 0.00 0.00 0.00 0
5: 011001000 0.00 0.00 0.00 0
6: 011001100 0.00 0.00 1723.24 0
data3
id 2013_principal 2013_interest 2014_principal 2014_interest
1: 011000600 0.00 0.00 544.03 0
2: 011000700 0.00 0.00 0.01 0
3: 011000800 363.44 12.79 2005.98 0
4: 011000900 0.00 0.00 0.00 0
5: 011001000 0.00 0.00 0.00 0
6: 011001100 0.00 0.00 1723.24 0
Seems pretty silly for the naming convention of dcast to default to this style, given that I imagine this type of reshaping is ubiquitous.
I've also tried patching things up ex-post given some other posts I've found (e.g. here), but it runs unfathomably slow (there are ~400 variables to rename in the full data set)
names(data)<-ifelse(substr(names(data),1,2) %in% c("19","20"),
paste(substr(names(data),6,nchar(data)),
substr(names(data),1,4),sep="_") ,
names(copy))
(I'm trying to find all the variables starting with years--19xx or 20xx--and trying to swap the beginning and end)
Upvotes: 5
Views: 3645
Reputation: 118809
Even better, with the new developments to dcast
in data.table from v1.9.5+, we can cast multiple columns simultaneously..
require(data.table) # v1.9.5+
dcast(dt, id ~ year, value.var = c("principal", "interest"))
# id principal_2013 principal_2014 interest_2013 interest_2014
# 1: 11000600 0.00 544.03 0.00 0
# 2: 11000700 0.00 0.01 0.00 0
# 3: 11000800 363.44 2005.98 12.79 0
# 4: 11000900 0.00 0.00 0.00 0
# 5: 11001000 0.00 0.00 0.00 0
# 6: 11001100 0.00 1723.24 0.00 0
# 7: 11001560 0.00 0.00 0.00 0
# 8: 11001650 0.00 0.00 0.00 0
No more having to melt
unnecessarily before to cast
, therefore quite efficient.
Upvotes: 3
Reputation: 118809
o
dcast.data.table(dt, a ~ ... + b)
now generates the column names with values from 'b' coming last. Closes #5675.
That is, now you can do:
dcast.data.table(datam, id ~ ... + year)
# id principal_2013 principal_2014 interest_2013 interest_2014
# 1: 11000600 0.00 544.03 0.00 0
# 2: 11000700 0.00 0.01 0.00 0
# 3: 11000800 363.44 2005.98 12.79 0
# 4: 11000900 0.00 0.00 0.00 0
# 5: 11001000 0.00 0.00 0.00 0
# 6: 11001100 0.00 1723.24 0.00 0
# 7: 11001560 0.00 0.00 0.00 0
# 8: 11001650 0.00 0.00 0.00 0
and the column names will have year
values at the end, as expected.
o ?dcast.data.table now explains how the names are generated for the columns that are being casted. Closes #5676.
Now ?dcast.data.table
contains the added line:
Names for columns that are being cast are generated in the same order (separated by a _) from the (unique) values in each column mentioned in the formula RHS.
HTH
Upvotes: 3
Reputation: 206308
@Arun's solution is to explicitly set the right hand of the formula to the order that you wish to name the columns.
data1 <- dcast.data.table(datam,id ~ variable+year, value.var="value")
Upvotes: 1