Reputation: 3664
I would like to add a column to a data frame with the count of matches in another data frame, this seems pretty trivial but I can't seem to get it to work. Example:
smaller_df$CountOfMatches <- nrow(subset(larger_df, Date == smaller_df$Date))
This gives me the error:
In `==.default`(Date, smaller_df$Date) :
longer object length is not a multiple of shorter object length
I know the data frames are different lengths, I'm not asking for a merge, I simply need for that for every row/date (valid date object) in the smaller_df; a count how many matches in the larger_df.
I'm very new to R so there must be something fundamental and very trivial I'm missing here.
Thanks in advance
Upvotes: 1
Views: 1448
Reputation: 58825
It is easiest to create a summary table and then merge that in with your original (smaller) data. It's better to have a reproducible example. So here is some reproducible data:
smaller_df <- data.frame(Date=seq(as.Date("2000-01-01"),
as.Date("2000-01-10"), by="1 day"))
set.seed(5)
larger_df <- data.frame(Date=sample(seq(as.Date("2000-01-01"),
as.Date("2000-01-20"), by="1 day"),
80, replace=TRUE))
Create the table (counts) of dates in larger_df
tbl <- table(larger_df$Date)
Convert this to a data.frame suitable for merging
counts <- data.frame(Date=as.Date(names(tbl)), CountOfMatches=as.vector(tbl))
And then merge on date. Note that if a date doesn't appear in larger_df
but does in smaller_df
, then the CountOfMatches
will be NA
rather than 0
.
merge(smaller_df, counts, all.x=TRUE)
For this sample data, you get
> merge(smaller_df, counts, all.x=TRUE)
Date CountOfMatches
1 2000-01-01 4
2 2000-01-02 2
3 2000-01-03 5
4 2000-01-04 4
5 2000-01-05 5
6 2000-01-06 6
7 2000-01-07 2
8 2000-01-08 5
9 2000-01-09 3
10 2000-01-10 3
EDIT:
A more concise version which uses a package (which provides the convenience functions which get rid of some of the transformation details) is
library("plyr")
merge(smaller_df,
ddply(larger_df, .(Date), summarise, CountOfMatches=length(Date)),
all.x = TRUE)
Same result and, effectively, the same logic. Also the same caveat about dates that don't appear in larger_df
.
Upvotes: 4
Reputation: 263301
Here's what seems fairly straightforward:
smaller_df$bigDfCount <-sapply( smaller_df$Date,
FUN=function(x) length(larger_df[larger_df$Date==x, "Date"] ) )
smaller_df
Date bigDfCount
1 2000-01-01 4
2 2000-01-02 2
3 2000-01-03 5
4 2000-01-04 4
5 2000-01-05 5
6 2000-01-06 6
7 2000-01-07 2
8 2000-01-08 5
9 2000-01-09 3
10 2000-01-10 3
Upvotes: 4
Reputation: 81
There is a way to do this using the data.table package. This is a package for handling large datasets efficiently in memory, allowing SQL-like or SAS data step-like manipulation, but the square brackets [] behave differently from data.frame objects. You can put data.table joins, expressions and aggregation within []. Read the data.table manual to find out more.
First, convert your two frames to data.table objects, and set the key column to be Date. The data.table objects will be sorted by Date, and can then be joined.
Using the same sample data as above:
library(data.table)
smaller_df <- data.table(data.frame(Date=seq(as.Date("2000-01-01"),
as.Date("2000-01-10"), by="1 day")))
set.seed(5)
larger_df <- data.table(data.frame(Date=sample(seq(as.Date("2000-01-01"),
as.Date("2000-01-20"), by="1 day"), 80, replace=TRUE)))
Set the key column to be Date:
setkey(smaller_df, Date)
setkey(larger_df, Date)
You can use the by-without-by syntax and use the fact you have keyed by date. .N
will return the number of rows in the subset (ie the number of rows where the dates match).
larger_df[smaller_df, .N]
## Date N
## 1: 2000-01-01 4
## 2: 2000-01-02 2
## 3: 2000-01-03 5
## 4: 2000-01-04 4
## 5: 2000-01-05 5
## 6: 2000-01-06 6
## 7: 2000-01-07 2
## 8: 2000-01-08 5
## 9: 2000-01-09 3
## 10: 2000-01-10 3
Upvotes: 4