Anna
Anna

Reputation: 83

Comparing two dataframes based on a condition

I have following two data frames. df1 consists of an average size of a parameter for individual users. df2 contains the average daily value for the same parameter for users.

I want to count the number of days for the events where df2$size > df1$size for each user.

df1 = read.table(text='user  size
AAL0706 29000
AAN0823 25000
AAV0450 30000', stringsAsFactors=FALSE, header=TRUE)


df2 = read.table(text='Date    user    size
      2010-01-04 AAL0706 31054
      2010-01-06 AAL0706 20703
      2010-01-08 AAL0706 39968
      2010-01-04 AAN0823 17892
      2010-01-06 AAN0823 37839
      2010-01-08 AAN0823 19649
      2010-01-04 AAV0450 35432
      2010-01-06 AAV0450 37839', stringsAsFactors=FALSE, header=TRUE)

Expected output is:

   user  count
AAL0706      2
AAN0823      1
AAV0450      2

I have tried to use the following command to calculate my results, but I realise something is wrong.

lapply(df1, function(y) { 
    ddply(df2$size, .(user), function(x) { 
        return(length(y$size(y$size > x$size))
    })
})

Could you please advice me of an efficient way of doing this?

Upvotes: 2

Views: 120

Answers (2)

Arun
Arun

Reputation: 118889

A slightly more straightforward solution using data.table would be to use the new non-equi joins feature available in the current development version of data.table, v1.9.7.

require(data.table)
setDT(df2)[df1, .N, on=.(user, size > size), by=.EACHI]

Each row of df1 is matched against all rows of df2 based on the condition provided to the on argument, i.e., match for exact value of user and look for all rows where size of df2 is greater within that user.

Once the matching rows are obtained (for each row), the expression .N (= count of matching rows) is evaluated for each row, because by = .EACHI implies this. It instructs to execute the expression provided to the second argument j to run for each i (the first argument).

See installation instructions for devel version here.

Upvotes: 1

akrun
akrun

Reputation: 887971

We can do a left_join from dplyr, grouped by 'user' get the sum of logical index (size.x > size.y)

library(dplyr)
left_join(df2, df1, by = "user") %>% 
            group_by(user) %>% 
            summarise(Count = sum(size.x > size.y))
#       user Count
#     <chr> <int>
#1 AAL0706     2
#2 AAN0823     1
#3 AAV0450     2

Or using data.table

library(data.table)
setDT(df2)[df1, .(count = sum(size > i.size)),on = "user", by = .EACHI]
#      user count
#1: AAL0706     2
#2: AAN0823     1
#3: AAV0450     2

Upvotes: 1

Related Questions