Reputation: 303
I've been trying learn R for a while but haven't got my knowledge up to even a decent level yet. I'll get there in the end, but I'm in a pinch at the moment and was wondering if you could help me do a quick "transformation" type piece.
I have a csv data file with 18 million rows with the following data fields: Person ID, Date and Value. It's basically from a simulation model and is simulating the contributions a person makes into their savings accounts, e.g.:
1,28/02/2013,19.49
2,13/03/2013,16.68
3,15/03/2013,20.34
2,10/01/2014,28.43
3,12/06/2014,38.13
1,29/08/2014,68.46
1,20/12/2013,20.51
So, as you can see, there can be multiple IDs in the data but each date and contribution amount for a person is unique.
I would like to transform this so I have a contribution history by year for each person. So for example the above would become:
ID,2013,2014
1,40.00,68.46
2,16.68,28.43
3,20.34,38.13
I have a rough idea how I could approach the problem: create another column of data with just the years and then summarise by ID and year to add up all contributions that fit into each ID/year bucket. I just have no clue how to even begin translating that into an R script.
Any pointers/guidance would be most appreciated.
Many Thanks and Kind Regards.
Upvotes: 3
Views: 1897
Reputation: 60984
The approach you describe is a sound one. Translating the date string back and forth from string to date and back can be done using strptime
and strftime
(possible as.POSIXct
. Once you have the year
column, you can use a number of tools available in R, e.g. data.table
, by
, or ddply
. I like the syntax of the last one:
library(plyr)
ddply(df, .(ID, year), summarise, total_per_year = sum(value))
This assumes that your base date is in df
, and that the columns in your data are called year
, ID
and value
. Do note that for large datasets ddply
can become quite slow. If you really need raw performance, you definitely want to start working with data.table
.
Upvotes: 3
Reputation: 270195
Here are a few possibilities:
zoo package read.zoo
in the zoo package can produce a multivariate time series one column per series, i.e. one column per ID. We define yr
to get the year from the index column and then split on the ID using the split=
argument as we read it in. We use aggregate=sum
to aggregate over the remaining columns -- here just one. We use text = Lines to keep the code below self contained but with a real file we would replace that with "myfile"
, say. The final line transposes the result. We could drop that line if it were OK to have persons in columns instead of rows.
Lines <- "1,28/02/2013,19.49
2,13/03/2013,16.68
3,15/03/2013,20.34
2,10/01/2014,28.43
3,12/06/2014,38.13
1,29/08/2014,68.46
1,20/12/2013,20.51
"
library(zoo)
# given a Date string, x, output the year
yr <- function(x) floor(as.numeric(as.yearmon(x, "%d/%m/%Y")))
# read in data, reshape & aggregate
z <- read.zoo(text = Lines, sep = ",", index = 2, FUN = yr,
aggregate = sum, split = 1)
# transpose (optional)
tz <- data.frame(ID = colnames(z), t(z), check.names = FALSE)
With the posted data we get the following result:
> tz
ID 2013 2014
1 1 40.00 68.46
2 2 16.68 28.43
3 3 20.34 38.13
See ?read.zoo
and also the zoo-read
vignette.
reshape2 package Here is a second solution using the reshape2 package:
library(reshape2)
# read in and fix up column names and Year
DF <- read.table(text = Lines, sep = ",") ##
colnames(DF) <- c("ID", "Year", "Value") ##
DF$Year <- sub(".*/", "", DF$Year) ##
dcast(DF, ID ~ Year, fun.aggregate = sum, value.var = "Value")
The result is:
ID 2013 2014
1 1 40.00 68.46
2 2 16.68 28.43
3 3 20.34 38.13
reshape function Here is a solution that does not use any addon packages. First read in the data using the three lines marked ## in the last solution. This will produce DF
. Then aggregate the data, reshape it from long to wide form and finally fix up the column names:
Ag <- aggregate(Value ~., DF, sum)
res <- reshape(Ag, direction = "wide", idvar = "ID", timevar = "Year")
colnames(res) <- sub("Value.", "", colnames(res))
which produces this:
> res
ID 2013 2014
1 1 40.00 68.46
2 2 16.68 28.43
3 3 20.34 38.13
tapply function. This solution does not use addon packages either. Using Ag
from the last solution try this:
tapply(Ag$Value, Ag[1:2], sum)
UPDATES: minor improvements and 3 additional solutions.
Upvotes: 7