Simple pivot table type transformation in R statistics

Question

I've been trying learn R for a while but haven't got my knowledge up to even a decent level yet. I'll get there in the end, but I'm in a pinch at the moment and was wondering if you could help me do a quick "transformation" type piece.

I have a csv data file with 18 million rows with the following data fields: Person ID, Date and Value. It's basically from a simulation model and is simulating the contributions a person makes into their savings accounts, e.g.:

1,28/02/2013,19.49
2,13/03/2013,16.68
3,15/03/2013,20.34
2,10/01/2014,28.43
3,12/06/2014,38.13
1,29/08/2014,68.46
1,20/12/2013,20.51

So, as you can see, there can be multiple IDs in the data but each date and contribution amount for a person is unique.

I would like to transform this so I have a contribution history by year for each person. So for example the above would become:

ID,2013,2014
1,40.00,68.46
2,16.68,28.43
3,20.34,38.13

I have a rough idea how I could approach the problem: create another column of data with just the years and then summarise by ID and year to add up all contributions that fit into each ID/year bucket. I just have no clue how to even begin translating that into an R script.

Any pointers/guidance would be most appreciated.

Many Thanks and Kind Regards.

G. Grothendieck · Accepted Answer

Here are a few possibilities:

zoo package read.zoo in the zoo package can produce a multivariate time series one column per series, i.e. one column per ID. We define yr to get the year from the index column and then split on the ID using the split= argument as we read it in. We use aggregate=sum to aggregate over the remaining columns -- here just one. We use text = Lines to keep the code below self contained but with a real file we would replace that with "myfile", say. The final line transposes the result. We could drop that line if it were OK to have persons in columns instead of rows.

Lines <- "1,28/02/2013,19.49
2,13/03/2013,16.68
3,15/03/2013,20.34
2,10/01/2014,28.43
3,12/06/2014,38.13
1,29/08/2014,68.46
1,20/12/2013,20.51
"
library(zoo)

# given a Date string, x, output the year
yr <- function(x) floor(as.numeric(as.yearmon(x, "%d/%m/%Y")))

# read in data, reshape & aggregate
z <- read.zoo(text = Lines, sep = ",", index = 2, FUN = yr,  
   aggregate = sum, split = 1)     

# transpose (optional)  
tz <- data.frame(ID = colnames(z), t(z), check.names = FALSE)

With the posted data we get the following result:

> tz
  ID  2013  2014
1  1 40.00 68.46
2  2 16.68 28.43
3  3 20.34 38.13

See ?read.zoo and also the zoo-read vignette.

reshape2 package Here is a second solution using the reshape2 package:

library(reshape2)

# read in and fix up column names and Year

DF <- read.table(text = Lines, sep = ",") ##
colnames(DF) <- c("ID", "Year", "Value") ##
DF$Year <- sub(".*/", "", DF$Year) ##

dcast(DF, ID ~ Year, fun.aggregate = sum, value.var = "Value")

The result is:

  ID  2013  2014
1  1 40.00 68.46
2  2 16.68 28.43
3  3 20.34 38.13

reshape function Here is a solution that does not use any addon packages. First read in the data using the three lines marked ## in the last solution. This will produce DF. Then aggregate the data, reshape it from long to wide form and finally fix up the column names:

Ag <- aggregate(Value ~., DF, sum)
res <- reshape(Ag, direction = "wide", idvar = "ID", timevar = "Year")
colnames(res) <- sub("Value.", "", colnames(res))

which produces this:

> res
  ID  2013  2014
1  1 40.00 68.46
2  2 16.68 28.43
3  3 20.34 38.13

tapply function. This solution does not use addon packages either. Using Ag from the last solution try this:

tapply(Ag$Value, Ag[1:2], sum)

UPDATES: minor improvements and 3 additional solutions.

Simple pivot table type transformation in R statistics

Answers (2)

Related Questions