Reputation: 55
I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like
incumbents <- tapply(id, destination-year, function(x) length(unique(x)))
and it works just fine. My problem is that incumbents has length length(destination-year)
, and I need it to have length length(id)
-there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.
Any suggestions?
Upvotes: 0
Views: 1298
Reputation: 176668
You don't provide a reproducible example, so I can't test this, but you should be able to use ave
:
incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))
Upvotes: 1
Reputation: 173587
Using @JohnColby's excellent example data, I was thinking of something more along the lines of this:
#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, destinationYear)
#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))
#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)
datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]
Upvotes: 2
Reputation: 22588
Just "merge" the tapply
summary back in with the original data frame with merge
.
Since you didn't provide example data, I made some. Modify accordingly.
n = 1000
id = sample(1:10, n, replace=T)
year = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)
`destination-year` = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, `destination-year`)
Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.
incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)
Finally, merge back in with the original data:
merge(dat, incumbents)
By the way, instead of combining destination
and year
into a third variable, like it seems you've done, tapply
can handle both variables directly as a list:
incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))
Upvotes: 3