Reputation: 1633
I am trying to do something in R that is equivalent to a "first value" windowing function in SQL, such as:
select *, first_value(value3) over (partition by value1, value2 order by value5 desc)
from df
Is there a good way to do this in R without having to create a database to use something like sqldf/RPostgreSQL?
I've seen some other examples like the answers here: Cumulative sum by group in sqldf?
but I'm having some trouble figuring out how to do it with a function where the order within the window matters. Another thing is that I'm working with ~500K rows, and so performance is an concern.
Edit: here's an example:
item_id rental_date customer_id
I001 10/20/2012 1
I002 10/05/2012 4
I001 10/15/2012 3
I004 10/19/2012 1
I001 10/11/2012 6
I002 9/15/2012 5
I004 10/13/2012 10
I001 9/30/2012 4
How would I determine the first customer to rent a given item for each month?
Upvotes: 1
Views: 875
Reputation: 269421
If by not using sqldf/PostgreSQL you mean using sqldf but using SQLite instead of PostgreSQL then try this (which relies on a new feature of SQLite added over the last year in which if there is a min or max then the other columns are guaranteed to be from the same row):
Lines <- "item_id rental_date customer_id
I001 10/20/2012 1
I002 10/05/2012 4
I001 10/15/2012 3
I004 10/19/2012 1
I001 10/11/2012 6
I002 9/15/2012 5
I004 10/13/2012 10
I001 9/30/2012 4"
DF <- read.table(text = Lines, as.is = TRUE, header = TRUE)
DF$rental_date <- as.Date(DF$rental_date, "%m/%d/%Y")
DF$ym <- format(DF$rental_date, "%Y-%m")
sqldf("select item_id, ym, customer_id, min(rental_date) rental_date
from DF
group by item_id, ym")
The result in this case is:
item_id ym customer_id rental_date
1 I001 2012-09 4 2012-09-30
2 I001 2012-10 6 2012-10-11
3 I002 2012-09 5 2012-09-15
4 I002 2012-10 4 2012-10-05
5 I004 2012-10 10 2012-10-13
Upvotes: 4
Reputation: 7113
I assume that the object in your example is a data.frame
which we call df
.
library( "plyr" )
df$rental_date <- as.Date( df$rental_date, "%m/%d/%Y" )
df$year <- as.numeric( format( df$rental_date, "%Y"))
df$month <- months( df$rental_date )
ddply( df, c("item_id", "month", "year"), function(x) {
x[ min(x$rental_date) == x$rental_date, "customer_id", drop=FALSE ]
} )
The result should look like:
item_id month year customer_id
1 I001 October 2012 6
2 I001 September 2012 4
3 I002 October 2012 4
4 I002 September 2012 5
5 I004 October 2012 10
Upvotes: 2