Reputation: 201
I'm sure this question has been posed before, but would like some input on my specific question. In return for your help, I'll use an interesting example.
Sean Lahman provides giant datasets of MLB baseball statistics, available free on his website (http://www.seanlahman.com/baseball-archive/statistics/).
I'd like to use this data to answer the following question: What is the average number of home runs per game recorded for each decade in the MLB?
Below I've pasted all relevant script:
teamdata = read.csv("Teams.csv", header = TRUE)
decades = c(1870,1880,1890,1900,1910,1920,1930,1940,1950,1960,1970,1980,1990,2000,2010,2020)
i = 0
meanhomers = c()
for(i in c(1:length(decades))){
meanhomers[i] = mean(teamdata$HR[teamdata$yearID>=decades[i] & teamdata$yearID<decades[i+1]]);
i = i+1
}
My primary question is, how could this answer have been determined without resorting to the dreaded for-loop?
Side question: What simple script would have generated the decades vector for me?
(For those interested in the answer to the baseball question, see below.)
meanhomers
[1] 4.641026 23.735849 34.456522 20.421053 25.755682 61.837500 84.012500
[8] 80.987500 130.375000 132.166667 120.093496 126.700000 148.737410 173.826667
[15] 152.973333 NaN
Edit for clarity: Turns out I answered the wrong question; the answer provided above indicates the number of home runs per team per year, not per game. A little fix of the denominator would get the correct result.
Upvotes: 4
Views: 233
Reputation: 42689
aggregate
is handy for this sort of thing. You can use your decades
object with findInterval
to put the years into bins:
aggregate(HR ~ findInterval(yearID, decades), data=teamdata, FUN=mean)
## findInterval(yearID, decades) HR
## 1 1 4.641026
## 2 2 23.735849
## 3 3 34.456522
## 4 4 20.421053
## 5 5 25.755682
## 6 6 61.837500
## 7 7 84.012500
## 8 8 80.987500
## 9 9 130.375000
## 10 10 132.166667
## 11 11 120.093496
## 12 12 126.700000
## 13 13 148.737410
## 14 14 173.826667
## 15 15 152.973333
Note that the intervals used are left-closed, as you desire. Also note that the intervals need not be regular. Yours are, which leads to the "side question" of how to produce the decades
vector: don't even compute it. Instead, directly compute which decade each year falls in:
aggregate(HR ~ I(10 * (yearID %/% 10)), data=teamdata, FUN=mean)
## I(10 * (yearID%/%10)) HR
## 1 1870 4.641026
## 2 1880 23.735849
## 3 1890 34.456522
## 4 1900 20.421053
## 5 1910 25.755682
## 6 1920 61.837500
## 7 1930 84.012500
## 8 1940 80.987500
## 9 1950 130.375000
## 10 1960 132.166667
## 11 1970 120.093496
## 12 1980 126.700000
## 13 1990 148.737410
## 14 2000 173.826667
## 15 2010 152.973333
I usually prefer the formula interface to aggregate
as used above, but you can get better names directly by using the non-formula interface. Here's the example for each of the above:
with(teamdata, aggregate(list(mean.HR=HR), list(Decade=findInterval(yearID,decades)), FUN=mean))
## Decade mean.HR
## 1 1 4.641026
## ...
with(teamdata, aggregate(list(mean.HR=HR), list(Decade=10 * (yearID %/% 10)), FUN=mean))
## Decade mean.HR
## 1 1870 4.641026
## ...
Upvotes: 2
Reputation: 180
Here's a data.table
example. Because others showed how to use cut
, I took another route for splitting the data into decades:
teamdata[,list(HRperYear=mean(HR)),by=10*floor((yearID)/10)]
However, the original question mentions average HRs per game, not per year (though the code and answers clearly deal with HRs per year).
Here's how you could compute average HRs per game (and average games per team per year):
teamdata[,list(HRperYear=mean(HR),HRperGame=sum(HR)/sum(G),games=mean(G)),by=10*floor(yearID/10)]
floor HRperYear HRperGame games
1: 1870 4.641026 0.08911866 52.07692
2: 1880 23.735849 0.21543555 110.17610
3: 1890 34.456522 0.25140108 137.05797
4: 1900 20.421053 0.13686067 149.21053
5: 1910 25.755682 0.17010657 151.40909
6: 1920 61.837500 0.40144445 154.03750
7: 1930 84.012500 0.54593453 153.88750
8: 1940 80.987500 0.52351325 154.70000
9: 1950 130.375000 0.84289640 154.67500
10: 1960 132.166667 0.81977946 161.22222
11: 1970 120.093496 0.74580935 161.02439
12: 1980 126.700000 0.80990313 156.43846
13: 1990 148.737410 0.95741873 155.35252
14: 2000 173.826667 1.07340167 161.94000
15: 2010 152.973333 0.94427984 162.00000
(The low average game totals in the 1980's and 1990's are due to the 1981 and 1994-5 player strikes).
Upvotes: 6
Reputation: 479
You can also use the sqldf
package in order to use SQL queries on the data.
Here is the code:
library(sqldf)
sqldf("select floor(yearID/10)*10 as decade,avg(hr) as count
from Teams
group by decade;")
decade count
1 1870 4.641026
2 1880 23.735849
3 1890 34.456522
4 1900 20.421053
5 1910 25.755682
6 1920 61.837500
7 1930 84.012500
8 1940 80.987500
9 1950 130.375000
10 1960 132.166667
11 1970 120.093496
12 1980 126.700000
13 1990 148.737410
14 2000 173.826667
15 2010 152.973333
Upvotes: 3
Reputation: 56935
PS: Nicely-written question, but it would be extra nice for you to provide a fully reproducible example so that I don't have to go and download the CSV to answer your question. Making dummy data is OK.
You can use seq
to generate sequences.
decades <- seq(1870, 2020, by=10)
You can use cut
to split up numeric variables into intervals.
teamdata$decade <- cut(teamdata$yearID, breaks=decades, dig.lab=4)
Basically it creates a factor with one level for each decade (as specified by the breaks
). The dig.lab=4
is just so it prints the years as e.g. "1870" not "1.87e+03".
See ?cut
for further configuration (e.g. is '1980' included in this decade or the next one, & so on. You can even configure the labels if you think you'll use them.)
Then to do something for each decade, use the plyr
package (data.table
and dplyr
are other options, but I think plyr
has the easiest learning curve, and your data does not seem very large to need data.table
).
library(plyr)
ddply(teamdata, .(decade), summarize, meanhomers=mean(HR))
decade meanhomers
1 (1870,1880] 4.930233
2 (1880,1890] 25.409091
3 (1890,1900] 35.115702
4 (1900,1910] 20.068750
5 (1910,1920] 27.284091
6 (1920,1930] 67.681250
7 (1930,1940] 84.050000
8 (1940,1950] 84.125000
9 (1950,1960] 130.718750
10 (1960,1970] 133.349515
11 (1970,1980] 117.745968
12 (1980,1990] 127.584615
13 (1990,2000] 155.053191
14 (2000,2010] 170.226667
15 (2010,2020] 152.775000
Mine is a little different to yours because my intervals are (, ]
whereas yours are [, )
. Can adjust cut
to switch these around.
Upvotes: 4
Reputation: 3947
dplyr::group_by
, mixed with cut
is a good option here, and avoids looping. The decades
vector is just a stepped sequence.
decades <- seq(1870,2020,by=10)
cut
breaks the data into categories, which I've labelled by the decades themselves for clarity.
teamdata$decade <- cut(teamdata$yearID, breaks=decades, right=FALSE, labels=decades[1:(length(decades)-1)])
Then dplyr
handles the grouped summarise as neatly as you could hope
library(dplyr)
teamdata %>% group_by(decade) %>% summarise(meanhomers=mean(HR))
# decade meanhomers
# (fctr) (dbl)
# 1 1870 4.641026
# 2 1880 23.735849
# 3 1890 34.456522
# 4 1900 20.421053
# 5 1910 25.755682
# 6 1920 61.837500
# 7 1930 84.012500
# 8 1940 80.987500
# 9 1950 130.375000
# 10 1960 132.166667
# 11 1970 120.093496
# 12 1980 126.700000
# 13 1990 148.737410
# 14 2000 173.826667
# 15 2010 152.973333
Upvotes: 1