Reputation: 3233
I'm working on the baseball data set:
data(baseball, package="plyr")
library(dplyr)
baseball[,1:4] %>% head
id year stint team
4 ansonca01 1871 1 RC1
44 forceda01 1871 1 WS3
68 mathebo01 1871 1 FW1
99 startjo01 1871 1 NY2
102 suttoez01 1871 1 CL1
106 whitede01 1871 1 CL1
First I want to group the data set by team in order to find the first year each team appears, and the number of distinct players that has ever played for each team:
baseball[,1:4] %>% group_by(team) %>%
summarise("first_year"=min(year), "num_distinct_players"=n_distinct(id))
# A tibble: 132 × 3
team first_year num_distinct_players
<chr> <int> <int>
1 ALT 1884 1
2 ANA 1997 29
3 ARI 1998 43
4 ATL 1966 133
5 BAL 1954 158
Now I want to add a column showing the maximum number of years any player (id) has played for the team in question. To do this, I need to somehow group by player within the existing group (team), and select the maximum number of rows. How do I do this?
Upvotes: 1
Views: 2566
Reputation: 32548
I tried doing this with base R
and came up with this. It's fairly slow.
df = data.frame(t(sapply(split(baseball, baseball$team), function(x)
cbind( min(x$year),
length(unique(x$id)),
max(sapply(split(x,x$id), function(y)
nrow(y))),
names(which.max(sapply(split(x,x$id), function(y)
nrow(y)))) ))))
colnames(df) = c("Year", "Unique Players", "Longest played duration",
"Longest Playing Player")
id
s which is the number of players in that teamid
and obtain the maximum number of rows that will give the maximum duration played by a player in that teamnames
of the id
with maximum rows which gives the name of the player that played for the longest time in that teamUpvotes: 1
Reputation: 887118
Perhaps this helps
baseball %>%
select(1:4) %>%
group_by(id, team) %>%
dplyr::mutate(nyear = n_distinct(year)) %>%
group_by(team) %>%
dplyr::summarise(first_year = min(year),
num_distinct_players = n_distinct(id),
maxYear = max(nyear))
Upvotes: 4