user2734306
user2734306

Reputation: 13

Data Transformation in R for Panel Regression

I really need your help regarding a problem which may seem easy to solve for you.

Currently I work on a project which involves some panel-regressions. I have several large csv-files (up to 12 million entries per sheet) which are formatted as in the picture attached, whereas the columns (V1, V2) are individuals and the rows (1, 2, 3) are time identifiers.

In order to use the plm()-function I need all these files to convert to the following data structure:

ID Time X1 X2
1 1 x1 x2
1 2 x1 x2
1 ... ... ...
2 1 x1 x2
2 2 ... ...

I really struggle with this transformation and I'm really frustrated right now i.e. where do I get the identifier and the time index from? Would really appreciate if you could provide me with information how to solve this problem.

If my question is not clear to you, just ask.

Best regards and thanks in advance

enter image description here

The output should look like as follows:

enter image description here

Upvotes: 1

Views: 5111

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193637

Here is an alternative: Use Stacked from my "splitstackshape" package.

Here it is applied on @Metrics's sample data:

# install.packages("splitstackshape")
library(splitstackshape)
Stacked(cbind(id = 1:nrow(mydata), mydata), 
        id.vars="id", var.stubs="V", sep = "V")
#     id .time_1  V
#  1:  1       1 10
#  2:  1       2 21
#  3:  1       3 31
#  4:  2       1 11
#  5:  2       2 22
#  6:  2       3 32
#  7:  3       1 12
#  8:  3       2 23
#  9:  3       3  3
# 10:  4       1 13
# 11:  4       2 24
# 12:  4       3 34

It would be very fast if your data are large. Here are the speeds for the 12MB dataset you linked to. The sorting is different but the data are the same.

It still isn't faster than stack though (but at some point, stack starts to slow down).

See the system.times below:

reshape()

system.time(out <- reshape(x, idvar = "time", ids = row.names(x),
                           times = names(x), timevar = "id",
                           varying = list(names(x)), 
                           v.names="value",
                           new.row.names = 1:prod(dim(x)), 
                           direction = "long"))
#    user  system elapsed 
#   53.11    0.00   53.11 
head(out)
#   id        value time
# 1 V1  0.003808635    1
# 2 V1 -0.018807416    2
# 3 V1  0.008875447    3
# 4 V1  0.001148695    4
# 5 V1 -0.019365004    5
# 6 V1  0.012436560    6

Stacked()

system.time(out2 <- Stacked(cbind(id = 1:nrow(x), x), 
                            id.vars="id", var.stubs="V", 
                            sep = "V"))
#    user  system elapsed 
#    0.30    0.00    0.29 

out2
#           id .time_1            V
#      1:    1       1  0.003808635
#      2:    1      10 -0.014184635
#      3:    1     100 -0.013341843
#      4:    1     101  0.006784138
#      5:    1     102  0.006463707
#     ---                          
# 963868: 2317      95  0.009569451
# 963869: 2317      96  0.002497771
# 963870: 2317      97  0.009202519
# 963871: 2317      98  0.017007545
# 963872: 2317      99 -0.002495842

stack()

system.time(out3 <- cbind(id = 1:nrow(x), stack(x)))
#    user  system elapsed 
#    0.09    0.00    0.09
head(out3)
#   id       values ind
# 1  1  0.003808635  V1
# 2  2 -0.018807416  V1
# 3  3  0.008875447  V1
# 4  4  0.001148695  V1
# 5  5 -0.019365004  V1
# 6  6  0.012436560  V1

Upvotes: 1

Metrics
Metrics

Reputation: 15458

 mydata<-structure(list(V1 = 10:13, V2 = 21:24, V3 = c(31L, 32L, 3L, 34L
    )), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, 
    -4L))

> mydata
  V1 V2 V3
1 10 21 31
2 11 22 32
3 12 23  3
4 13 24 34

The following code can be used for your data without changing anything. For illustration, I used just the above data. I used the base R reshape function

long <- reshape(mydata, idvar = "time", ids = row.names(mydata),
                times = names(mydata), timevar = "id",
                varying = list(names(mydata)),v.names="value", new.row.names = 1:((dim(mydata)[2])*(dim(mydata)[1])),direction = "long")

> long
   id value time
1  V1    10    1
2  V1    11    2
3  V1    12    3
4  V1    13    4
5  V2    21    1
6  V2    22    2
7  V2    23    3
8  V2    24    4
9  V3    31    1
10 V3    32    2
11 V3     3    3
12 V3    34    4
long$id<-substr(long$id,2,4) # 4 is used to take into account your 416 variables
myout<-long[,c(1,3,2)]
> myout
   id time value
1   1    1    10
2   1    2    11
3   1    3    12
4   1    4    13
5   2    1    21
6   2    2    22
7   2    3    23
8   2    4    24
9   3    1    31
10  3    2    32
11  3    3     3
12  3    4    34

Upvotes: 2

Related Questions