Klaus Louis
Klaus Louis

Reputation: 151

R: Count unique values by category

I have data in R that looks like this:

 Cnty   Yr   Plt       Spp  DBH Ht Age
 1  185 1999 20001 Bitternut  8.0 54  47
 2  185 1999 20001 Bitternut  7.2 55  50
 3   31 1999 20001    Pignut  7.4 71  60
 4   31 1999 20001    Pignut 11.4 85 114
 5  189 1999 20001        WO 14.5 80  82
 6  189 1999 20001        WO 12.1 72  79

I would like to know the quantity of unique species (Spp) in each county (Cnty). "unique(dfname$Spp)" gives me a total count of unique species in the data frame, but I would like it by county.

Any help is appreciated! Sorry for the weird formatting, this is my first ever question on SO.

Thanks.

Upvotes: 15

Views: 49463

Answers (7)

Jot eN
Jot eN

Reputation: 6416

set.seed(1)
mydf <- data.frame(
  Cnty = rep(c("185", "31", "189"), times = c(5, 3, 2)),
  Yr = c(rep(c("1999", "2000"), times = c(3, 2)), 
         "1999", "1999", "2000", "2000", "2000"),
  Plt = "20001",
  Spp = sample(c("Bitternut", "Pignut", "WO"), 10, replace = TRUE),
  DBH = runif(10, 0, 15)
)
mydf

The dplyr package could help here:


library(dplyr)
mydf %>% 
  group_by(Cnty) %>% 
  summarise(un_Spp = length(unique(Spp)))
#> # A tibble: 3 × 2
#>   Cnty  un_Spp
#>   <chr>  <int>
#> 1 185        3
#> 2 189        2
#> 3 31         2

Upvotes: 1

Vaibhav Bhat
Vaibhav Bhat

Reputation: 1

We can now use the tally function to make this easier.

tally(group_by(mydf, Spp, Cnty))

        Spp   Cnty     n
     <fctr> <fctr> <int>
1 Bitternut    185     2
2 Bitternut    189     1
3    Pignut    185     2
4    Pignut    189     1
5    Pignut     31     1
6        WO    185     1
7        WO     31     2

Upvotes: -1

rafa.pereira
rafa.pereira

Reputation: 13827

A simple solution using the data.table approach.

library(data.table)

output <- setDT(mydf)[ , .(count=.N) , by = .(Spp,Cnty)]

in case you want to reshape the output into a nicer table format:

library(tidyr)

spread(data=a, key =Spp, count)

#   Cnty Bitternut Pignut WO
# 1:  185         2      2  1
# 2:  189         1      1 NA
# 3:   31        NA      1  2

# or perhaps like this:

spread(data=a, key =Cnty, count)

#          Spp 185 189 31
# 1: Bitternut   2   1 NA
# 2:    Pignut   2   1  1
# 3:        WO   1  NA  2

Upvotes: 0

Matt Ober
Matt Ober

Reputation: 118

I wanted to add on to what A Handcart And Mohair mentioned. For those of you wanting to get the results of the code below into a data frame(helpful in R studio)...

with(mydf, table(Spp, Cnty))
#            Cnty
# Spp         185 189 31
#   Bitternut   2   1  0
#   Pignut      2   1  1
#   WO          1   0  2
ftable(mydf, row.vars="Spp", col.vars=c("Cnty", "Yr"))
#           Cnty  185       189        31     
#           Yr   1999 2000 1999 2000 1999 2000
# Spp                                         
# Bitternut         1    1    0    1    0    0
# Pignut            2    0    0    1    0    1
# WO                0    1    0    0    2    0

You'll need to put the as.data.frame.matrix modifier in front of your code like so:

as.data.frame.matrix(with(mydf, table(Spp, Cnty)))

I was pretty new to R when I came upon this post, and it took me a long time to figure that out, so I thought I'd share.

Upvotes: 0

user3835068
user3835068

Reputation: 1

with(mydf, tapply(Spp, list(Cnty, Yr), 
     FUN = function(x) length(unique(x))))

unique query is not working with large data set i mean data more than 1000k row.

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

I've tried to make your sample data a little bit more interesting. Your sample data presently has just one unique "Spp" per "Cnty".

set.seed(1)
mydf <- data.frame(
  Cnty = rep(c("185", "31", "189"), times = c(5, 3, 2)),
  Yr = c(rep(c("1999", "2000"), times = c(3, 2)), 
         "1999", "1999", "2000", "2000", "2000"),
  Plt = "20001",
  Spp = sample(c("Bitternut", "Pignut", "WO"), 10, replace = TRUE),
  DBH = runif(10, 0, 15)
)
mydf
#    Cnty   Yr   Plt       Spp       DBH
# 1   185 1999 20001 Bitternut  3.089619
# 2   185 1999 20001    Pignut  2.648351
# 3   185 1999 20001    Pignut 10.305343
# 4   185 2000 20001        WO  5.761556
# 5   185 2000 20001 Bitternut 11.547621
# 6    31 1999 20001        WO  7.465489
# 7    31 1999 20001        WO 10.764278
# 8    31 2000 20001    Pignut 14.878591
# 9   189 2000 20001    Pignut  5.700528
# 10  189 2000 20001 Bitternut 11.661678

Next, as suggested, tapply is a good candidate here. Combine unique and length to get the data you are looking for.

with(mydf, tapply(Spp, Cnty, FUN = function(x) length(unique(x))))
# 185 189  31 
#   3   2   2 
with(mydf, tapply(Spp, list(Cnty, Yr), FUN = function(x) length(unique(x))))
#     1999 2000
# 185    2    2
# 189   NA    2
# 31     1    1

If you're interested in simple tabulation (not of unique values), then you can explore table and ftable:

with(mydf, table(Spp, Cnty))
#            Cnty
# Spp         185 189 31
#   Bitternut   2   1  0
#   Pignut      2   1  1
#   WO          1   0  2
ftable(mydf, row.vars="Spp", col.vars=c("Cnty", "Yr"))
#           Cnty  185       189        31     
#           Yr   1999 2000 1999 2000 1999 2000
# Spp                                         
# Bitternut         1    1    0    1    0    0
# Pignut            2    0    0    1    0    1
# WO                0    1    0    0    2    0

Upvotes: 19

Arhopala
Arhopala

Reputation: 366

As Justin mentioned aggregate is probably what you want. If you call your data frame foo, then the following should give you what you want, namely the number of individuals per species assuming that each row with Butternut represents a unique individual belonging to the butternut species. Note I used foo$Age to calculate the length of the vector, i.e. the number of individuals (row) belonging to each species, but you could use foo$Ht or foo$DBH etc.

aggregate(foo$Age, by = foo[c('Spp','Cnty')], length)

Cheers,

Danny

Upvotes: 2

Related Questions