gatapia
gatapia

Reputation: 3664

Adding a count column in one data frame with the count of matches in another data frame

I would like to add a column to a data frame with the count of matches in another data frame, this seems pretty trivial but I can't seem to get it to work. Example:

smaller_df$CountOfMatches <- nrow(subset(larger_df, Date == smaller_df$Date))

This gives me the error:

In `==.default`(Date, smaller_df$Date) :
  longer object length is not a multiple of shorter object length

I know the data frames are different lengths, I'm not asking for a merge, I simply need for that for every row/date (valid date object) in the smaller_df; a count how many matches in the larger_df.

I'm very new to R so there must be something fundamental and very trivial I'm missing here.

Thanks in advance

Upvotes: 1

Views: 1448

Answers (3)

Brian Diggs
Brian Diggs

Reputation: 58825

It is easiest to create a summary table and then merge that in with your original (smaller) data. It's better to have a reproducible example. So here is some reproducible data:

smaller_df <- data.frame(Date=seq(as.Date("2000-01-01"), 
                                  as.Date("2000-01-10"), by="1 day"))
set.seed(5)
larger_df <- data.frame(Date=sample(seq(as.Date("2000-01-01"), 
                                        as.Date("2000-01-20"), by="1 day"),
                                    80, replace=TRUE))

Create the table (counts) of dates in larger_df

tbl <- table(larger_df$Date)

Convert this to a data.frame suitable for merging

counts <- data.frame(Date=as.Date(names(tbl)), CountOfMatches=as.vector(tbl))

And then merge on date. Note that if a date doesn't appear in larger_df but does in smaller_df, then the CountOfMatches will be NA rather than 0.

merge(smaller_df, counts, all.x=TRUE)

For this sample data, you get

> merge(smaller_df, counts, all.x=TRUE)
         Date CountOfMatches
1  2000-01-01              4
2  2000-01-02              2
3  2000-01-03              5
4  2000-01-04              4
5  2000-01-05              5
6  2000-01-06              6
7  2000-01-07              2
8  2000-01-08              5
9  2000-01-09              3
10 2000-01-10              3

EDIT:

A more concise version which uses a package (which provides the convenience functions which get rid of some of the transformation details) is

library("plyr")
merge(smaller_df, 
      ddply(larger_df, .(Date), summarise, CountOfMatches=length(Date)),
      all.x = TRUE)

Same result and, effectively, the same logic. Also the same caveat about dates that don't appear in larger_df.

Upvotes: 4

IRTFM
IRTFM

Reputation: 263301

Here's what seems fairly straightforward:

smaller_df$bigDfCount <-sapply( smaller_df$Date,  
                        FUN=function(x) length(larger_df[larger_df$Date==x, "Date"] ) )
smaller_df

         Date bigDfCount
1  2000-01-01          4
2  2000-01-02          2
3  2000-01-03          5
4  2000-01-04          4
5  2000-01-05          5
6  2000-01-06          6
7  2000-01-07          2
8  2000-01-08          5
9  2000-01-09          3
10 2000-01-10          3

Upvotes: 4

anoop
anoop

Reputation: 81

There is a way to do this using the data.table package. This is a package for handling large datasets efficiently in memory, allowing SQL-like or SAS data step-like manipulation, but the square brackets [] behave differently from data.frame objects. You can put data.table joins, expressions and aggregation within []. Read the data.table manual to find out more.

First, convert your two frames to data.table objects, and set the key column to be Date. The data.table objects will be sorted by Date, and can then be joined.

Using the same sample data as above:

library(data.table)
smaller_df <- data.table(data.frame(Date=seq(as.Date("2000-01-01"), 
    as.Date("2000-01-10"), by="1 day")))
set.seed(5)
larger_df <- data.table(data.frame(Date=sample(seq(as.Date("2000-01-01"), 
    as.Date("2000-01-20"), by="1 day"), 80, replace=TRUE)))

Set the key column to be Date:

setkey(smaller_df, Date)
setkey(larger_df, Date)

You can use the by-without-by syntax and use the fact you have keyed by date. .N will return the number of rows in the subset (ie the number of rows where the dates match).

larger_df[smaller_df, .N]
##         Date   N
##  1: 2000-01-01 4
##  2: 2000-01-02 2
##  3: 2000-01-03 5
##  4: 2000-01-04 4
##  5: 2000-01-05 5
##  6: 2000-01-06 6
##  7: 2000-01-07 2
##  8: 2000-01-08 5
##  9: 2000-01-09 3
## 10: 2000-01-10 3

Upvotes: 4

Related Questions