goldisfine
goldisfine

Reputation: 4850

Get count of group-level observations with multiple individual observations from dataframe in R

How do I get a dataframe like this:

soccer_player country position
"sam"         USA     left defender
"jon"         USA     right defender
"sam"         USA     left midfielder
"jon"         USA     offender
"bob"         England goalie
"julie"       England central midfielder
"jane"        England goalie

To look like this (country with the counts of unique players per country):

country player_count
USA     2
England 3

The obvious complication is that there are multiple observations per player, so I cannot simply do table(df$country) to get the number of observations per country.

I have been playing with the table() and merge() functions but have not had any luck.

Upvotes: 11

Views: 16464

Answers (5)

talat
talat

Reputation: 70336

One more base R option, using aggregate:

> aggregate(soccer_player ~ country, dd, FUN = function(x) length(unique(x)))
#  country soccer_player
#1 England             3
#2     USA             2

Upvotes: 1

G. Grothendieck
G. Grothendieck

Reputation: 270438

Here is an sqldf solution:

library(sqldf)

sqldf("select country, count(distinct soccer_player) player_count 
       from df 
       group by country")

##   country player_count
## 1 England            2
## 2     USA            2

and here is a base R solution:

as.data.frame(xtabs(~ country, unique(df[1:2])), responseName = "player_count")

##   country player_count
## 1 England            2
## 2     USA            2

Upvotes: 3

Ben Bolker
Ben Bolker

Reputation: 227001

The new features of dplyr v 3.0 provide a compact solution:

Data:

dd <- read.csv(text='
soccer_player,country,position
"sam",USA,left defender
"jon",USA,right defender
"sam",USA,left midfielder
"jon",USA,offender
"bob",England,goalie
"julie",England,central midfielder
"jane",England,goalie')

Code:

library(dplyr)

dd %>% distinct(soccer_player,country) %>% 
       count(country)

Upvotes: 7

Matthew Plourde
Matthew Plourde

Reputation: 44634

Here's one way:

as.data.frame(table(unique(d[-3])$country))
#      Var1 Freq
# 1 England    3
# 2     USA    2

Drop the third column, remove any duplicate Country-Name pairs, then count the occurrences of each country.

Upvotes: 7

Se&#241;or O
Se&#241;or O

Reputation: 17432

Without using any packages you can do:

List = by(df, df$country, function(x) length(unique(x$soccer_player)))
DataFrame = do.call(rbind, lapply(names(List), function(x) 
  data.frame(country=x, player_count=List[[x]])))
#  country player_count
#1 England            2
#2     USA            2

It's easier with something like data.table:

dt = data.table(df)
dt[,list(player_count = length(unique(soccer_player))),by=country]

Upvotes: 6

Related Questions