Lmm
Lmm

Reputation: 413

dplyr group by and summarize in tally

Example data below...

I want to tally the number of "type" points counted per month (type is shipping vessels). So initially I want to summarize how many of "type" vessels are counted in each month total. e.g. June has 5 counts of fishing vessels points.

preferably using dplyr:

I have something like:

dfsum <- df  %>% group_by(Month, Type) %>% tally()

Which works well enough however, I further would like to do the above but also by unique vessel ID's - a ship can have multiple points per month, but I would like to know how many unique vessels are present each month.

I could just add group by id:

dfsum2 <- df  %>% group_by(Month, id,Type) %>% tally()

However, this is less tidy and with a larger data set would be harder to compile - rather I want the result that in Feb there are 2 unique fishing vessels (using this data example) - is there a better way to extract this information?

Desired output:

Month   Type      n
Jan     Fishing  x
Feb     Fishing  x
Feb     Sailing  x
March   Fishing  x

Where x is the number or count of unique vessels by ID in that category that month.

#Dummy data

df<- structure(list(UTC_Time = structure(c(1L, 1L, 1L, 1L, 339L, 339L, 
339L, 68L, 68L, 68L, 154L, 154L, 154L, 154L, 154L, 154L, 14L, 
14L, 14L, 14L, 14L, 15L, 50L, 50L, 51L, 51L, 51L, 51L, 51L, 51L, 
51L, 51L, 51L, 77L, 146L, 147L, 147L, 147L, 147L, 147L, 148L, 
148L), .Label = c("2018-01-01 0:00:00", "2018-01-02 0:00:00", 
"2018-01-03 0:00:00", "2018-01-04 0:00:00", "2018-01-05 0:00:00", 
"2018-01-06 0:00:00", "2018-01-07 0:00:00", "2018-01-08 0:00:00", 
"2018-01-09 0:00:00", "2018-01-10 0:00:00", "2018-01-11 0:00:00", 
"2018-01-12 0:00:00", "2018-01-13 0:00:00", "2018-01-14 0:00:00", 
"2018-01-15 0:00:00", "2018-01-16 0:00:00", "2018-01-17 0:00:00", 
"2018-01-18 0:00:00", "2018-01-19 0:00:00", "2018-01-20 0:00:00", 
"2018-01-21 0:00:00", "2018-01-22 0:00:00", "2018-01-23 0:00:00", 
"2018-01-24 0:00:00", "2018-01-25 0:00:00", "2018-01-26 0:00:00", 
"2018-01-27 0:00:00", "2018-01-28 0:00:00", "2018-01-29 0:00:00", 
"2018-01-30 0:00:00", "2018-01-31 0:00:00", "2018-02-01 0:00:00", 
"2018-02-02 0:00:00", "2018-02-03 0:00:00", "2018-02-04 0:00:00", 
"2018-02-05 0:00:00", "2018-02-06 0:00:00", "2018-02-07 0:00:00", 
"2018-02-08 0:00:00", "2018-02-09 0:00:00", "2018-02-10 0:00:00", 
"2018-02-11 0:00:00", "2018-02-12 0:00:00", "2018-02-13 0:00:00", 
"2018-02-14 0:00:00", "2018-02-15 0:00:00", "2018-02-16 0:00:00", 
"2018-02-17 0:00:00", "2018-02-18 0:00:00", "2018-02-19 0:00:00", 
"2018-02-20 0:00:00", "2018-02-21 0:00:00", "2018-02-22 0:00:00", 
"2018-02-23 0:00:00", "2018-02-24 0:00:00", "2018-02-25 0:00:00", 
"2018-02-26 0:00:00", "2018-02-27 0:00:00", "2018-02-28 0:00:00", 
 "2018-03-01 0:00:00", "2018-03-02 0:00:00", "2018-03-03 0:00:00", 
"2018-03-04 0:00:00", "2018-03-05 0:00:00", "2018-03-06 0:00:00", 
"2018-03-07 0:00:00", "2018-03-08 0:00:00", "2018-03-09 0:00:00", 
"2018-03-10 0:00:00", "2018-03-11 0:00:00", "2018-03-12 0:00:00", 
"2018-03-13 0:00:00", "2018-03-14 0:00:00", "2018-03-15 0:00:00", 
"2018-03-16 0:00:00", "2018-03-17 0:00:00", "2018-03-18 0:00:00", 
"2018-03-19 0:00:00", "2018-03-20 0:00:00", "2018-03-21 0:00:00", 
"2018-03-22 0:00:00", "2018-03-23 0:00:00", "2018-03-24 0:00:00", 
"2018-03-25 0:00:00", "2018-03-26 0:00:00", "2018-03-27 0:00:00", 
"2018-03-28 0:00:00", "2018-03-29 0:00:00", "2018-03-30 0:00:00", 
"2018-03-31 0:00:00", "2018-04-01 0:00:00", "2018-04-02 0:00:00", 
"2018-04-03 0:00:00", "2018-04-04 0:00:00", "2018-04-05 0:00:00", 
"2018-04-06 0:00:00", "2018-04-07 0:00:00", "2018-04-08 0:00:00", 
 "2018-04-09 0:00:00", "2018-04-10 0:00:00", "2018-04-11 0:00:00", 
"2018-04-12 0:00:00", "2018-04-13 0:00:00", "2018-04-14 0:00:00", 
"2018-04-15 0:00:00", "2018-04-16 0:00:00", "2018-04-17 0:00:00", 
"2018-04-18 0:00:00", "2018-04-19 0:00:00", "2018-04-20 0:00:00", 
"2018-04-21 0:00:00", "2018-04-22 0:00:00", "2018-04-23 0:00:00", 
"2018-04-24 0:00:00", "2018-04-25 0:00:00", "2018-04-26 0:00:00", 
 "2018-04-27 0:00:00", "2018-04-28 0:00:00", "2018-04-29 0:00:00", 
"2018-04-30 0:00:00", "2018-05-01 0:00:00", "2018-05-02 0:00:00", 
"2018-05-03 0:00:00", "2018-05-04 0:00:00", "2018-05-05 0:00:00", 
"2018-05-06 0:00:00", "2018-05-07 0:00:00", "2018-05-08 0:00:00", 
"2018-05-09 0:00:00", "2018-05-10 0:00:00", "2018-05-11 0:00:00", 
"2018-05-12 0:00:00", "2018-05-13 0:00:00", "2018-05-14 0:00:00", 
"2018-05-15 0:00:00", "2018-05-16 0:00:00", "2018-05-17 0:00:00", 
"2018-05-18 0:00:00", "2018-05-19 0:00:00", "2018-05-20 0:00:00", 
"2018-05-21 0:00:00", "2018-05-22 0:00:00", "2018-05-23 0:00:00", 
"2018-05-24 0:00:00", "2018-05-25 0:00:00", "2018-05-26 0:00:00", 
"2018-05-27 0:00:00", "2018-05-28 0:00:00", "2018-05-29 0:00:00", 
"2018-05-30 0:00:00", "2018-05-31 0:00:00", "2018-06-01 0:00:00", 
"2018-06-02 0:00:00", "2018-06-03 0:00:00", "2018-06-04 0:00:00", 
"2018-06-05 0:00:00", "2018-06-06 0:00:00", "2018-06-07 0:00:00", 
"2018-06-08 0:00:00", "2018-06-09 0:00:00", "2018-06-10 0:00:00", 
"2018-06-11 0:00:00", "2018-06-12 0:00:00", "2018-06-13 0:00:00", 
"2018-06-14 0:00:00", "2018-06-15 0:00:00", "2018-06-16 0:00:00", 
"2018-06-17 0:00:00", "2018-06-18 0:00:00", "2018-06-19 0:00:00", 
"2018-06-20 0:00:00", "2018-06-21 0:00:00", "2018-06-22 0:00:00", 
"2018-06-23 0:00:00", "2018-06-24 0:00:00", "2018-06-25 0:00:00", 
"2018-06-26 0:00:00", "2018-06-27 0:00:00", "2018-06-28 0:00:00", 
"2018-06-29 0:00:00", "2018-06-30 0:00:00", "2018-07-01 0:00:00", 
"2018-07-02 0:00:00", "2018-07-03 0:00:00", "2018-07-04 0:00:00", 
"2018-07-05 0:00:00", "2018-07-06 0:00:00", "2018-07-07 0:00:00", 
"2018-07-08 0:00:00", "2018-07-09 0:00:00", "2018-07-10 0:00:00", 
"2018-07-11 0:00:00", "2018-07-12 0:00:00", "2018-07-13 0:00:00", 
"2018-07-14 0:00:00", "2018-07-15 0:00:00", "2018-07-16 0:00:00", 
 "2018-07-17 0:00:00", "2018-07-18 0:00:00", "2018-07-19 0:00:00", 
"2018-07-20 0:00:00", "2018-07-21 0:00:00", "2018-07-22 0:00:00", 
 "2018-07-23 0:00:00", "2018-07-24 0:00:00", "2018-07-25 0:00:00", 
"2018-07-26 0:00:00", "2018-07-27 0:00:00", "2018-07-28 0:00:00", 
"2018-07-29 0:00:00", "2018-07-30 0:00:00", "2018-07-31 0:00:00", 
"2018-08-01 0:00:00", "2018-08-02 0:00:00", "2018-08-03 0:00:00", 
 "2018-08-04 0:00:00", "2018-08-05 0:00:00", "2018-08-06 0:00:00", 
 "2018-08-07 0:00:00", "2018-08-08 0:00:00", "2018-08-09 0:00:00", 
"2018-08-10 0:00:00", "2018-08-11 0:00:00", "2018-08-12 0:00:00", 
 "2018-08-13 0:00:00", "2018-08-14 0:00:00", "2018-08-15 0:00:00", 
"2018-08-16 0:00:00", "2018-08-17 0:00:00", "2018-08-18 0:00:00", 
"2018-08-19 0:00:00", "2018-08-20 0:00:00", "2018-08-21 0:00:00", 
"2018-08-22 0:00:00", "2018-08-23 0:00:00", "2018-08-24 0:00:00", 
"2018-08-25 0:00:00", "2018-08-26 0:00:00", "2018-08-27 0:00:00", 
"2018-08-28 0:00:00", "2018-08-29 0:00:00", "2018-08-30 0:00:00", 
"2018-08-31 0:00:00", "2018-09-01 0:00:00", "2018-09-02 0:00:00", 
"2018-09-03 0:00:00", "2018-09-04 0:00:00", "2018-09-05 0:00:00", 
"2018-09-06 0:00:00", "2018-09-07 0:00:00", "2018-09-08 0:00:00", 
"2018-09-09 0:00:00", "2018-09-10 0:00:00", "2018-09-11 0:00:00", 
"2018-09-12 0:00:00", "2018-09-13 0:00:00", "2018-09-14 0:00:00", 
"2018-09-15 0:00:00", "2018-09-16 0:00:00", "2018-09-17 0:00:00", 
"2018-09-18 0:00:00", "2018-09-19 0:00:00", "2018-09-20 0:00:00", 
"2018-09-21 0:00:00", "2018-09-22 0:00:00", "2018-09-23 0:00:00", 
 "2018-09-24 0:00:00", "2018-09-25 0:00:00", "2018-09-26 0:00:00", 
 "2018-09-27 0:00:00", "2018-09-28 0:00:00", "2018-09-29 0:00:00", 
 "2018-09-30 0:00:00", "2018-10-01 0:00:00", "2018-10-02 0:00:00", 
  "2018-10-03 0:00:00", "2018-10-04 0:00:00", "2018-10-05 0:00:00", 
  "2018-10-06 0:00:00", "2018-10-07 0:00:00", "2018-10-08 0:00:00", 
  "2018-10-09 0:00:00", "2018-10-10 0:00:00", "2018-10-11 0:00:00", 
  "2018-10-12 0:00:00", "2018-10-13 0:00:00", "2018-10-14 0:00:00", 
  "2018-10-15 0:00:00", "2018-10-16 0:00:00", "2018-10-17 0:00:00", 
 "2018-10-18 0:00:00", "2018-10-19 0:00:00", "2018-10-20 0:00:00", 
  "2018-10-21 0:00:00", "2018-10-22 0:00:00", "2018-10-23 0:00:00", 
 "2018-10-24 0:00:00", "2018-10-25 0:00:00", "2018-10-26 0:00:00", 
 "2018-10-27 0:00:00", "2018-10-28 0:00:00", "2018-10-29 0:00:00", 
"2018-10-30 0:00:00", "2018-10-31 0:00:00", "2018-11-01 0:00:00", 
"2018-11-02 0:00:00", "2018-11-03 0:00:00", "2018-11-04 0:00:00", 
"2018-11-05 0:00:00", "2018-11-06 0:00:00", "2018-11-07 0:00:00", 
"2018-11-08 0:00:00", "2018-11-09 0:00:00", "2018-11-10 0:00:00", 
"2018-11-11 0:00:00", "2018-11-12 0:00:00", "2018-11-13 0:00:00", 
"2018-11-14 0:00:00", "2018-11-15 0:00:00", "2018-11-16 0:00:00", 
"2018-11-17 0:00:00", "2018-11-18 0:00:00", "2018-11-19 0:00:00", 
"2018-11-20 0:00:00", "2018-11-21 0:00:00", "2018-11-22 0:00:00", 
"2018-11-23 0:00:00", "2018-11-24 0:00:00", "2018-11-25 0:00:00", 
"2018-11-26 0:00:00", "2018-11-27 0:00:00", "2018-11-28 0:00:00", 
"2018-11-29 0:00:00", "2018-11-30 0:00:00", "2018-12-01 0:00:00", 
"2018-12-02 0:00:00", "2018-12-03 0:00:00", "2018-12-04 0:00:00", 
"2018-12-05 0:00:00", "2018-12-06 0:00:00", "2018-12-07 0:00:00", 
"2018-12-08 0:00:00", "2018-12-09 0:00:00", "2018-12-10 0:00:00", 
"2018-12-11 0:00:00", "2018-12-12 0:00:00", "2018-12-13 0:00:00", 
"2018-12-14 0:00:00", "2018-12-15 0:00:00", "2018-12-16 0:00:00", 
"2018-12-17 0:00:00", "2018-12-18 0:00:00", "2018-12-19 0:00:00", 
"2018-12-20 0:00:00", "2018-12-21 0:00:00", "2018-12-22 0:00:00", 
"2018-12-23 0:00:00", "2018-12-24 0:00:00", "2018-12-25 0:00:00", 
 "2018-12-26 0:00:00", "2018-12-27 0:00:00", "2018-12-28 0:00:00", 
"2018-12-29 0:00:00", "2018-12-30 0:00:00", "2018-12-31 0:00:00", 
"2019-01-01 0:00:00"), class = "factor"), Type = structure(c(4L, 
4L, 4L, 4L, 4L, 4L, 4L, 17L, 17L, 17L, 4L, 12L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 17L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L), .Label = c("Cargo ship", 
 "Cargo ship:DG,HS,MP(OS)", "Cargo ship:DG,HS,MP(X)", "Fishing", 
   "Law enforcement", "Local ship", "Passenger ship", "Passenger ship:DG,HS,MP(OS)", 
 "Passenger ship:DG,HS,MP(Y)", "Pilot", "Pleasure Craft", "Sailing", 
 "Search/rescue", "Ship", "Towing", "Towing(200/25)", "Tug"), class = "factor"), 
Month = structure(c(5L, 5L, 5L, 5L, 3L, 3L, 3L, 8L, 8L, 8L, 
7L, 7L, 7L, 7L, 7L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L), .Label = c("Apr", "Aug", "Dec", "Feb", "Jan", "Jul", 
"Jun", "Mar", "May", "Nov", "Oct", "Sep"), class = "factor"), 
id = c(27L, 27L, 27L, 27L, 21L, 21L, 21L, 24L, 24L, 24L, 
20L, 6L, 20L, 20L, 20L, 20L, 48L, 48L, 48L, 48L, 48L, 42L, 
34L, 34L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 23L, 
17L, 17L, 17L, 14L, 14L, 3L, 14L, 3L)), row.names = c(1L, 
2L, 3L, 4L, 650L, 651L, 652L, 262L, 263L, 264L, 400L, 401L, 402L, 
403L, 404L, 405L, 100L, 101L, 102L, 103L, 104L, 105L, 250L, 251L, 
252L, 253L, 254L, 255L, 256L, 257L, 258L, 259L, 260L, 300L, 301L, 
302L, 303L, 304L, 305L, 306L, 307L, 308L), class = "data.frame")

Upvotes: 0

Views: 727

Answers (2)

Duck
Duck

Reputation: 39623

A base R approach can be next (sometimes can be fast):

#Code
result <- aggregate(Type~Month,df,function(x) length(unique(x)))

Output:

  Month Type
1   Dec    1
2   Feb    1
3   Jan    1
4   Jun    2
5   Mar    1
6   May    1

Or maybe:

#Code 2
result2 <- aggregate(id~Month,df,function(x) length(unique(x)))

Output:

  Month id
1   Dec  1
2   Feb  2
3   Jan  3
4   Jun  2
5   Mar  2
6   May  3

Based on the expected output you can try this:

#Code
new <- aggregate(id~Month+Type,data=df,function(x) length(unique(x)))

Output:

  Month           Type id
1   Dec        Fishing  1
2   Feb        Fishing  2
3   Jan        Fishing  3
4   Jun        Fishing  1
5   May Passenger ship  3
6   Jun        Sailing  1
7   Mar            Tug  2

Or using dplyr:

library(dplyr)            
#Code
new <- df %>% group_by(Month,Type) %>% summarise(N=length(unique(id)))

Output:

# A tibble: 7 x 3
# Groups:   Month [6]
  Month Type               N
  <fct> <fct>          <int>
1 Dec   Fishing            1
2 Feb   Fishing            2
3 Jan   Fishing            3
4 Jun   Fishing            1
5 Jun   Sailing            1
6 Mar   Tug                2
7 May   Passenger ship     3

Upvotes: 2

akrun
akrun

Reputation: 887981

We can use n_distinct to find the number of unique 'Type' by 'Month'

library(dplyr)
df %>% 
      group_by(Month) %>% 
      summarise(n = n_distinct(Type))

-output

# A tibble: 6 x 2
#  Month     n
#  <fct> <int>
#1 Dec       1
#2 Feb       1
#3 Jan       1
#4 Jun       2
#5 Mar       1
#6 May       1

If it is based on 'id'

df %>%
    group_by(Month) %>%
    summarise(n = n_distinct(id))

-output

# A tibble: 6 x 2
#  Month     n
#  <fct> <int>
#1 Dec       1
#2 Feb       2
#3 Jan       3
#4 Jun       2
#5 Mar       2
#6 May       3

Or another option is to get the distinct rows and use count

 df %>% 
      distinct(Month, Type) %>%
      count(Month)

Or with data.table

library(data.table)
setDT(df)[, .(n = uniqueN(Type)), Month]

Or with base R

aggregate(Type ~ Month, unique(df[c('Type', 'Month')]), length)
aggregate(id ~ Month, unique(df[c('id', 'Month')]), length)

Regarding the efficiency of base R, especially aggregate, it would be slow as mentioned here

Upvotes: 1

Related Questions