hmmmbob
hmmmbob

Reputation: 1177

dplyr, summarise categorical variable

I want to summarise my data small for each different video.id using dplyr.

small %>% 
  group_by(Video.ID) %>% 
  summarise(sumr = sum(Partner.Revenue),
            len = mean(Video.Duration..sec.),
            cat = mean(Category))

mean(Category) is clearly the wrong approach. How do I get it just to use the value that is repeated several times (one video.id has always the same category no matter how often it appears in the dataframe).

My dataframe looks like this :

small

# A tibble: 6 x 7
     X1  X1_1 Video.ID    Video.Duration..sec. Category Owned.Views Partner.Revenue
  <int> <int> <chr>                      <int> <chr>          <int>           <dbl>
1     1     1 ---0zh9uzSE                 1184 gadgets            6               0
2     2     2 ---0zh9uzSE                 1184 gadgets            6               0
3     3     3 ---0zh9uzSE                 1184 gadgets            2               0
4     4     4 ---0zh9uzSE                 1184 gadgets            1               0
5     5     5 ---0zh9uzSE                 1184 gadgets            1               0
6     6     6 ---0zh9uzSE                 1184 gadgets            3               0

small <- 
  structure(list(X1 = 1:6, 
                 X1_1 = 1:6, 
                 Video.ID = c("---0zh9uzSE", "---0zh9uzSE", "---0zh9uzSE", "---0zh9uzSE", "---0zh9uzSE", "---0zh9uzSE"), 
                 Video.Duration..sec. = c(1184L, 1184L, 1184L, 1184L, 1184L, 1184L), 
                 Category = c("gadgets", "gadgets", "gadgets", "gadgets", "gadgets", "gadgets"), 
                 Owned.Views = c(6L, 6L, 2L, 1L, 1L, 3L), 
                 Partner.Revenue = c(0, 0, 0, 0, 0, 0)), 
            row.names = c(NA, -6L), 
            class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 5

Views: 9155

Answers (2)

kangaroo_cliff
kangaroo_cliff

Reputation: 6222

Since it is a unique category for each video_id, you can have cat = Category[1], as in

small %>% group_by(Video.ID) %>% 
      summarise(sumr=sum(Partner.Revenue), len = mean(Video.Duration..sec.), 
      cat = Category[1])

# A tibble: 1 x 4
#  Video.ID     sumr   len cat    
#  <chr>       <dbl> <dbl> <chr>  
#  1 ---0zh9uzSE     0  1184 gadgets

Upvotes: 1

kath
kath

Reputation: 7724

You have at least two options to solve this:

Add the Category column to your group_by:

small %>% 
  group_by(Video.ID, cat = Category) %>% 
  summarise(sumr = sum(Partner.Revenue),
            len = mean(Video.Duration..sec.))

# A tibble: 1 x 4
# Groups:   Video.ID [?]
#     Video.ID    cat      sumr   len
#     <chr>       <chr>   <dbl> <dbl>
#   1 ---0zh9uzSE gadgets     0  1184

Or use unique(Catregory):

small %>% 
  group_by(Video.ID) %>% 
  summarise(sumr = sum(Partner.Revenue),
            len = mean(Video.Duration..sec.),
            cat = unique(Category))

# A tibble: 1 x 4
#   Video.ID     sumr   len cat    
#   <chr>       <dbl> <dbl> <chr>  
# 1 ---0zh9uzSE     0  1184 gadgets

The first option, might be perferred, because it still works if you have multiple categories per id.

Upvotes: 6

Related Questions