RandomThinker
RandomThinker

Reputation: 391

R Dataframe Regroup and Summarize

I have a dataframe looks like below:

Year   Person   Office
2005   Peter    Boston 
2007   Peter    Boston
2008   Peter    Chicago 
2009   Peter    New York
2011   Peter    New York 
2003   Amy      Seattle 
2004   Amy      Boston 
2006   Amy      Chicago 
2007   Amy      Chicago

I want to calculate a office-person level normalized measure (count) that captures the number of offices that a person experienced before coming to the current one. The measure is normalized by the total amount of years before arriving at the current position. Below is the ideal output. For Peter, Boston is his first office and thus, his normalized measure count for Boston is 0. For Peter, Chicago is his second office and he spent 2008-2005=3 years before coming to the Chicago office. Thus, his normalized measure count for Chicago is 1/3.

Office    Person  Count
Boston    Peter   0
Boston    Amy     1
Chicago   Peter   1/3
Chicago   Amy     2/3
New York  Peter   1/2
Seattle   Amy     0

Upvotes: 3

Views: 81

Answers (2)

Mr.Rlover
Mr.Rlover

Reputation: 2623

library(tidyverse)

cities %>% 
  group_by(Person, Office) %>%
  filter(row_number() == 1) %>% 
  group_by(Person) %>% 
  mutate(x = row_number()-1, y = (Year - Year[1])) %>% 
  mutate(count = ifelse(is.nan(x / y), x, x/y))

#   Year Person Office         x     y  test
#  <int> <chr>  <chr>      <dbl> <int> <dbl>
# 1  2005 Peter  "Boston"       0     0 0    
# 2  2008 Peter  "Chicago"      1     3 0.333
# 3  2009 Peter  "New York"     2     4 0.5  
# 4  2003 Amy    "Seattle "     0     0 0    
# 5  2004 Amy    "Boston"       1     1 1    
# 6  2006 Amy    "Chicago"      2     3 0.667

If you want the count represented as a fraction, we can use a helper function from the package pracma to reduce the fractions

cities %>% 
  group_by(Person, Office) %>%
  filter(row_number() == 1) %>% 
  group_by(Person) %>% 
  mutate(x = row_number()-1, y = (Year - Year[1])) %>% 
  mutate(count = ifelse(is.nan(x / y), x, x/y)) %>% 
  mutate(frac = ifelse(x == 0,
                       0,
                       ifelse(x/y == 1, 1,
                              paste0(x / pracma::gcd(x,y), "/", y / pracma::gcd(x,y)))
                       )
  ) %>% 
  select(-x, -y)

#   Year Person Office     count frac 
#  <int> <chr>  <chr>      <dbl> <chr>
# 1  2005 Peter  "Boston"   0     0    
# 2  2008 Peter  "Chicago"  0.333 1/3  
# 3  2009 Peter  "New York" 0.5   1/2  
# 4  2003 Amy    "Seattle " 0     0    
# 5  2004 Amy    "Boston"   1     1    
# 6  2006 Amy    "Chicago"  0.667 2/3 

data:

cities <- read.delim(text = "Year,Person,Office
2005,Peter,Boston
2007,Peter,Boston
2008,Peter,Chicago
2009,Peter,New York
2011,Peter,New York
2003,Amy,Seattle 
2004,Amy,Boston
2006,Amy,Chicago
2007,Amy,Chicago", sep = ",")

Upvotes: 1

Martin Gal
Martin Gal

Reputation: 16998

You could use

library(dplyr)

df %>% 
  group_by(Person, Office) %>% 
  slice_min(Year) %>% 
  arrange(Year) %>% 
  add_count() %>% 
  group_by(Person) %>% 
  mutate(Count = if_else(cumsum(n) == 1, 0, (cumsum(n) - 1) / (Year - first(Year))),
         .keep = "unused") %>% 
  ungroup()

This returns

# A tibble: 6 x 3
  Person Office   Count
  <chr>  <chr>    <dbl>
1 Amy    Seattle  0    
2 Amy    Boston   1    
3 Peter  Boston   0    
4 Amy    Chicago  0.667
5 Peter  Chicago  0.333
6 Peter  New_York 0.5 

Upvotes: 3

Related Questions