Tyler Durden
Tyler Durden

Reputation: 303

Simple pivot table type transformation in R statistics

I've been trying learn R for a while but haven't got my knowledge up to even a decent level yet. I'll get there in the end, but I'm in a pinch at the moment and was wondering if you could help me do a quick "transformation" type piece.

I have a csv data file with 18 million rows with the following data fields: Person ID, Date and Value. It's basically from a simulation model and is simulating the contributions a person makes into their savings accounts, e.g.:

1,28/02/2013,19.49
2,13/03/2013,16.68
3,15/03/2013,20.34
2,10/01/2014,28.43
3,12/06/2014,38.13
1,29/08/2014,68.46
1,20/12/2013,20.51

So, as you can see, there can be multiple IDs in the data but each date and contribution amount for a person is unique.

I would like to transform this so I have a contribution history by year for each person. So for example the above would become:

ID,2013,2014
1,40.00,68.46
2,16.68,28.43
3,20.34,38.13

I have a rough idea how I could approach the problem: create another column of data with just the years and then summarise by ID and year to add up all contributions that fit into each ID/year bucket. I just have no clue how to even begin translating that into an R script.

Any pointers/guidance would be most appreciated.

Many Thanks and Kind Regards.

Upvotes: 3

Views: 1897

Answers (2)

Paul Hiemstra
Paul Hiemstra

Reputation: 60984

The approach you describe is a sound one. Translating the date string back and forth from string to date and back can be done using strptime and strftime (possible as.POSIXct. Once you have the year column, you can use a number of tools available in R, e.g. data.table, by, or ddply. I like the syntax of the last one:

library(plyr)
ddply(df, .(ID, year), summarise, total_per_year = sum(value))

This assumes that your base date is in df, and that the columns in your data are called year, ID and value. Do note that for large datasets ddply can become quite slow. If you really need raw performance, you definitely want to start working with data.table.

Upvotes: 3

G. Grothendieck
G. Grothendieck

Reputation: 270195

Here are a few possibilities:

zoo package read.zoo in the zoo package can produce a multivariate time series one column per series, i.e. one column per ID. We define yr to get the year from the index column and then split on the ID using the split= argument as we read it in. We use aggregate=sum to aggregate over the remaining columns -- here just one. We use text = Lines to keep the code below self contained but with a real file we would replace that with "myfile", say. The final line transposes the result. We could drop that line if it were OK to have persons in columns instead of rows.

Lines <- "1,28/02/2013,19.49
2,13/03/2013,16.68
3,15/03/2013,20.34
2,10/01/2014,28.43
3,12/06/2014,38.13
1,29/08/2014,68.46
1,20/12/2013,20.51
"
library(zoo)

# given a Date string, x, output the year
yr <- function(x) floor(as.numeric(as.yearmon(x, "%d/%m/%Y")))

# read in data, reshape & aggregate
z <- read.zoo(text = Lines, sep = ",", index = 2, FUN = yr,  
   aggregate = sum, split = 1)     

# transpose (optional)  
tz <- data.frame(ID = colnames(z), t(z), check.names = FALSE)

With the posted data we get the following result:

> tz
  ID  2013  2014
1  1 40.00 68.46
2  2 16.68 28.43
3  3 20.34 38.13

See ?read.zoo and also the zoo-read vignette.

reshape2 package Here is a second solution using the reshape2 package:

library(reshape2)

# read in and fix up column names and Year

DF <- read.table(text = Lines, sep = ",") ##
colnames(DF) <- c("ID", "Year", "Value") ##
DF$Year <- sub(".*/", "", DF$Year) ##

dcast(DF, ID ~ Year, fun.aggregate = sum, value.var = "Value") 

The result is:

  ID  2013  2014
1  1 40.00 68.46
2  2 16.68 28.43
3  3 20.34 38.13

reshape function Here is a solution that does not use any addon packages. First read in the data using the three lines marked ## in the last solution. This will produce DF. Then aggregate the data, reshape it from long to wide form and finally fix up the column names:

Ag <- aggregate(Value ~., DF, sum)
res <- reshape(Ag, direction = "wide", idvar = "ID", timevar = "Year")
colnames(res) <- sub("Value.", "", colnames(res))

which produces this:

> res
  ID  2013  2014
1  1 40.00 68.46
2  2 16.68 28.43
3  3 20.34 38.13

tapply function. This solution does not use addon packages either. Using Ag from the last solution try this:

tapply(Ag$Value, Ag[1:2], sum)

UPDATES: minor improvements and 3 additional solutions.

Upvotes: 7

Related Questions