Reputation: 201

Avoid For-Loops in R

I'm sure this question has been posed before, but would like some input on my specific question. In return for your help, I'll use an interesting example.

Sean Lahman provides giant datasets of MLB baseball statistics, available free on his website (http://www.seanlahman.com/baseball-archive/statistics/).

I'd like to use this data to answer the following question: What is the average number of home runs per game recorded for each decade in the MLB?

Below I've pasted all relevant script:

teamdata = read.csv("Teams.csv", header = TRUE)

decades = c(1870,1880,1890,1900,1910,1920,1930,1940,1950,1960,1970,1980,1990,2000,2010,2020)

i = 0
meanhomers = c()
for(i in c(1:length(decades))){
    meanhomers[i] = mean(teamdata$HR[teamdata$yearID>=decades[i] & teamdata$yearID<decades[i+1]]);
    i = i+1
}

My primary question is, how could this answer have been determined without resorting to the dreaded for-loop?

Side question: What simple script would have generated the decades vector for me?

(For those interested in the answer to the baseball question, see below.)

meanhomers
 [1]   4.641026  23.735849  34.456522  20.421053  25.755682  61.837500  84.012500
 [8]  80.987500 130.375000 132.166667 120.093496 126.700000 148.737410 173.826667
[15] 152.973333   NaN

Edit for clarity: Turns out I answered the wrong question; the answer provided above indicates the number of home runs per team per year, not per game. A little fix of the denominator would get the correct result.

Upvotes: 4

Answers (5)

Matthew Lundberg

Reputation: 42689

aggregate is handy for this sort of thing. You can use your decades object with findInterval to put the years into bins:

aggregate(HR ~ findInterval(yearID, decades), data=teamdata, FUN=mean)
##    findInterval(yearID, decades)         HR
## 1                              1   4.641026
## 2                              2  23.735849
## 3                              3  34.456522
## 4                              4  20.421053
## 5                              5  25.755682
## 6                              6  61.837500
## 7                              7  84.012500
## 8                              8  80.987500
## 9                              9 130.375000
## 10                            10 132.166667
## 11                            11 120.093496
## 12                            12 126.700000
## 13                            13 148.737410
## 14                            14 173.826667
## 15                            15 152.973333

Note that the intervals used are left-closed, as you desire. Also note that the intervals need not be regular. Yours are, which leads to the "side question" of how to produce the decades vector: don't even compute it. Instead, directly compute which decade each year falls in:

aggregate(HR ~ I(10 * (yearID %/% 10)), data=teamdata, FUN=mean)
##    I(10 * (yearID%/%10))         HR
## 1                   1870   4.641026
## 2                   1880  23.735849
## 3                   1890  34.456522
## 4                   1900  20.421053
## 5                   1910  25.755682
## 6                   1920  61.837500
## 7                   1930  84.012500
## 8                   1940  80.987500
## 9                   1950 130.375000
## 10                  1960 132.166667
## 11                  1970 120.093496
## 12                  1980 126.700000
## 13                  1990 148.737410
## 14                  2000 173.826667
## 15                  2010 152.973333

I usually prefer the formula interface to aggregate as used above, but you can get better names directly by using the non-formula interface. Here's the example for each of the above:

with(teamdata, aggregate(list(mean.HR=HR), list(Decade=findInterval(yearID,decades)), FUN=mean))
##    Decade    mean.HR
## 1       1   4.641026
## ...

with(teamdata, aggregate(list(mean.HR=HR), list(Decade=10 * (yearID %/% 10)),  FUN=mean))
##    Decade    mean.HR
## 1    1870   4.641026
## ...

Upvotes: 2

BCC

Reputation: 180

Here's a data.table example. Because others showed how to use cut, I took another route for splitting the data into decades:

teamdata[,list(HRperYear=mean(HR)),by=10*floor((yearID)/10)]

However, the original question mentions average HRs per game, not per year (though the code and answers clearly deal with HRs per year).

Here's how you could compute average HRs per game (and average games per team per year):

teamdata[,list(HRperYear=mean(HR),HRperGame=sum(HR)/sum(G),games=mean(G)),by=10*floor(yearID/10)]

    floor  HRperYear  HRperGame     games
 1:  1870   4.641026 0.08911866  52.07692
 2:  1880  23.735849 0.21543555 110.17610
 3:  1890  34.456522 0.25140108 137.05797
 4:  1900  20.421053 0.13686067 149.21053
 5:  1910  25.755682 0.17010657 151.40909
 6:  1920  61.837500 0.40144445 154.03750
 7:  1930  84.012500 0.54593453 153.88750
 8:  1940  80.987500 0.52351325 154.70000
 9:  1950 130.375000 0.84289640 154.67500
10:  1960 132.166667 0.81977946 161.22222
11:  1970 120.093496 0.74580935 161.02439
12:  1980 126.700000 0.80990313 156.43846
13:  1990 148.737410 0.95741873 155.35252
14:  2000 173.826667 1.07340167 161.94000
15:  2010 152.973333 0.94427984 162.00000

(The low average game totals in the 1980's and 1990's are due to the 1981 and 1994-5 player strikes).

Upvotes: 6

fhlgood

Reputation: 479

You can also use the sqldf package in order to use SQL queries on the data.

Here is the code:

library(sqldf)
sqldf("select floor(yearID/10)*10 as decade,avg(hr) as count
from Teams
group by decade;")

 decade      count
1    1870   4.641026
2    1880  23.735849
3    1890  34.456522
4    1900  20.421053
5    1910  25.755682
6    1920  61.837500
7    1930  84.012500
8    1940  80.987500
9    1950 130.375000
10   1960 132.166667
11   1970 120.093496
12   1980 126.700000
13   1990 148.737410
14   2000 173.826667
15   2010 152.973333

Upvotes: 3

mathematical.coffee

Reputation: 56935

PS: Nicely-written question, but it would be extra nice for you to provide a fully reproducible example so that I don't have to go and download the CSV to answer your question. Making dummy data is OK.

You can use seq to generate sequences.

decades <- seq(1870, 2020, by=10)

You can use cut to split up numeric variables into intervals.

teamdata$decade <- cut(teamdata$yearID, breaks=decades, dig.lab=4)

Basically it creates a factor with one level for each decade (as specified by the breaks). The dig.lab=4 is just so it prints the years as e.g. "1870" not "1.87e+03". See ?cut for further configuration (e.g. is '1980' included in this decade or the next one, & so on. You can even configure the labels if you think you'll use them.)

Then to do something for each decade, use the plyr package (data.table and dplyr are other options, but I think plyr has the easiest learning curve, and your data does not seem very large to need data.table).

library(plyr)
ddply(teamdata, .(decade), summarize, meanhomers=mean(HR))
        decade meanhomers
1  (1870,1880]   4.930233
2  (1880,1890]  25.409091
3  (1890,1900]  35.115702
4  (1900,1910]  20.068750
5  (1910,1920]  27.284091
6  (1920,1930]  67.681250
7  (1930,1940]  84.050000
8  (1940,1950]  84.125000
9  (1950,1960] 130.718750
10 (1960,1970] 133.349515
11 (1970,1980] 117.745968
12 (1980,1990] 127.584615
13 (1990,2000] 155.053191
14 (2000,2010] 170.226667
15 (2010,2020] 152.775000

Mine is a little different to yours because my intervals are (, ] whereas yours are [, ). Can adjust cut to switch these around.

Upvotes: 4

Jonathan Carroll

Reputation: 3947

dplyr::group_by, mixed with cut is a good option here, and avoids looping. The decades vector is just a stepped sequence.

decades <- seq(1870,2020,by=10)

cut breaks the data into categories, which I've labelled by the decades themselves for clarity.

teamdata$decade <- cut(teamdata$yearID, breaks=decades, right=FALSE, labels=decades[1:(length(decades)-1)])

Then dplyr handles the grouped summarise as neatly as you could hope

library(dplyr)
teamdata %>% group_by(decade) %>% summarise(meanhomers=mean(HR))
#    decade meanhomers
#    (fctr)      (dbl)
# 1    1870   4.641026
# 2    1880  23.735849
# 3    1890  34.456522
# 4    1900  20.421053
# 5    1910  25.755682
# 6    1920  61.837500
# 7    1930  84.012500
# 8    1940  80.987500
# 9    1950 130.375000
# 10   1960 132.166667
# 11   1970 120.093496
# 12   1980 126.700000
# 13   1990 148.737410
# 14   2000 173.826667
# 15   2010 152.973333

Upvotes: 1

Avoid For-Loops in R

Answers (5)

Related Questions