bobfet1
bobfet1

Reputation: 1633

Windowing functions in R?

I am trying to do something in R that is equivalent to a "first value" windowing function in SQL, such as:

select *, first_value(value3) over (partition by value1, value2 order by value5 desc)
from df

Is there a good way to do this in R without having to create a database to use something like sqldf/RPostgreSQL?

I've seen some other examples like the answers here: Cumulative sum by group in sqldf?

but I'm having some trouble figuring out how to do it with a function where the order within the window matters. Another thing is that I'm working with ~500K rows, and so performance is an concern.

Edit: here's an example:

item_id  rental_date   customer_id
I001     10/20/2012    1
I002     10/05/2012    4 
I001     10/15/2012    3
I004     10/19/2012    1
I001     10/11/2012    6
I002     9/15/2012     5
I004     10/13/2012    10
I001     9/30/2012     4

How would I determine the first customer to rent a given item for each month?

Upvotes: 1

Views: 875

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269421

If by not using sqldf/PostgreSQL you mean using sqldf but using SQLite instead of PostgreSQL then try this (which relies on a new feature of SQLite added over the last year in which if there is a min or max then the other columns are guaranteed to be from the same row):

Lines <- "item_id  rental_date   customer_id
I001     10/20/2012    1
I002     10/05/2012    4 
I001     10/15/2012    3
I004     10/19/2012    1
I001     10/11/2012    6
I002     9/15/2012     5
I004     10/13/2012    10
I001     9/30/2012     4"

DF <- read.table(text = Lines, as.is = TRUE, header = TRUE)
DF$rental_date <- as.Date(DF$rental_date, "%m/%d/%Y")
DF$ym <- format(DF$rental_date, "%Y-%m")

sqldf("select item_id, ym, customer_id, min(rental_date) rental_date
    from DF 
    group by item_id, ym")

The result in this case is:

  item_id      ym customer_id      rental_date
1    I001 2012-09           4       2012-09-30
2    I001 2012-10           6       2012-10-11
3    I002 2012-09           5       2012-09-15
4    I002 2012-10           4       2012-10-05
5    I004 2012-10          10       2012-10-13

Upvotes: 4

Beasterfield
Beasterfield

Reputation: 7113

I assume that the object in your example is a data.frame which we call df.

library( "plyr" )
df$rental_date <- as.Date( df$rental_date, "%m/%d/%Y" )
df$year <-  as.numeric( format( df$rental_date, "%Y"))
df$month <- months( df$rental_date )

ddply( df, c("item_id", "month", "year"), function(x) {
  x[ min(x$rental_date) == x$rental_date, "customer_id", drop=FALSE ]
} )

The result should look like:

  item_id     month year customer_id
1    I001   October 2012           6
2    I001 September 2012           4
3    I002   October 2012           4
4    I002 September 2012           5
5    I004   October 2012          10

Upvotes: 2

Related Questions