Reputation: 13
I really need your help regarding a problem which may seem easy to solve for you.
Currently I work on a project which involves some panel-regressions. I have several large csv-files (up to 12 million entries per sheet) which are formatted as in the picture attached, whereas the columns (V1, V2) are individuals and the rows (1, 2, 3) are time identifiers.
In order to use the plm()
-function I need all these files to convert to the following data structure:
ID Time X1 X2
1 1 x1 x2
1 2 x1 x2
1 ... ... ...
2 1 x1 x2
2 2 ... ...
I really struggle with this transformation and I'm really frustrated right now i.e. where do I get the identifier and the time index from? Would really appreciate if you could provide me with information how to solve this problem.
If my question is not clear to you, just ask.
Best regards and thanks in advance
The output should look like as follows:
Upvotes: 1
Views: 5111
Reputation: 193637
Here is an alternative: Use Stacked
from my "splitstackshape" package.
Here it is applied on @Metrics's sample data:
# install.packages("splitstackshape")
library(splitstackshape)
Stacked(cbind(id = 1:nrow(mydata), mydata),
id.vars="id", var.stubs="V", sep = "V")
# id .time_1 V
# 1: 1 1 10
# 2: 1 2 21
# 3: 1 3 31
# 4: 2 1 11
# 5: 2 2 22
# 6: 2 3 32
# 7: 3 1 12
# 8: 3 2 23
# 9: 3 3 3
# 10: 4 1 13
# 11: 4 2 24
# 12: 4 3 34
It would be very fast if your data are large. Here are the speeds for the 12MB dataset you linked to. The sorting is different but the data are the same.
It still isn't faster than stack
though (but at some point, stack
starts to slow down).
See the system.time
s below:
reshape()
system.time(out <- reshape(x, idvar = "time", ids = row.names(x),
times = names(x), timevar = "id",
varying = list(names(x)),
v.names="value",
new.row.names = 1:prod(dim(x)),
direction = "long"))
# user system elapsed
# 53.11 0.00 53.11
head(out)
# id value time
# 1 V1 0.003808635 1
# 2 V1 -0.018807416 2
# 3 V1 0.008875447 3
# 4 V1 0.001148695 4
# 5 V1 -0.019365004 5
# 6 V1 0.012436560 6
Stacked()
system.time(out2 <- Stacked(cbind(id = 1:nrow(x), x),
id.vars="id", var.stubs="V",
sep = "V"))
# user system elapsed
# 0.30 0.00 0.29
out2
# id .time_1 V
# 1: 1 1 0.003808635
# 2: 1 10 -0.014184635
# 3: 1 100 -0.013341843
# 4: 1 101 0.006784138
# 5: 1 102 0.006463707
# ---
# 963868: 2317 95 0.009569451
# 963869: 2317 96 0.002497771
# 963870: 2317 97 0.009202519
# 963871: 2317 98 0.017007545
# 963872: 2317 99 -0.002495842
stack()
system.time(out3 <- cbind(id = 1:nrow(x), stack(x)))
# user system elapsed
# 0.09 0.00 0.09
head(out3)
# id values ind
# 1 1 0.003808635 V1
# 2 2 -0.018807416 V1
# 3 3 0.008875447 V1
# 4 4 0.001148695 V1
# 5 5 -0.019365004 V1
# 6 6 0.012436560 V1
Upvotes: 1
Reputation: 15458
mydata<-structure(list(V1 = 10:13, V2 = 21:24, V3 = c(31L, 32L, 3L, 34L
)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-4L))
> mydata
V1 V2 V3
1 10 21 31
2 11 22 32
3 12 23 3
4 13 24 34
The following code can be used for your data without changing anything. For illustration, I used just the above data. I used the base R reshape
function
long <- reshape(mydata, idvar = "time", ids = row.names(mydata),
times = names(mydata), timevar = "id",
varying = list(names(mydata)),v.names="value", new.row.names = 1:((dim(mydata)[2])*(dim(mydata)[1])),direction = "long")
> long
id value time
1 V1 10 1
2 V1 11 2
3 V1 12 3
4 V1 13 4
5 V2 21 1
6 V2 22 2
7 V2 23 3
8 V2 24 4
9 V3 31 1
10 V3 32 2
11 V3 3 3
12 V3 34 4
long$id<-substr(long$id,2,4) # 4 is used to take into account your 416 variables
myout<-long[,c(1,3,2)]
> myout
id time value
1 1 1 10
2 1 2 11
3 1 3 12
4 1 4 13
5 2 1 21
6 2 2 22
7 2 3 23
8 2 4 24
9 3 1 31
10 3 2 32
11 3 3 3
12 3 4 34
Upvotes: 2