Reputation: 67
I am trying to find a way to create a column in my dataframe that will list out occurrences of each unique combination of personID and fiscal year.
I have a dataframe set up with variables like so:
Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017
So in this example I want to create an additional column in the df above that has something like 'year' which would list year 1 for both occurrences of id 250 and year 2017, but would have year 2 for id 250 and fiscal year 2018. Like so:
Person.Id Reported.Fiscal.Year year
250 2017 1
250 2017 1
250 2018 2
300 2018 1
511 2019 1
300 2018 1
700 2017 1
I've tried the following code:
df1 <- df1 %>% arrange(Person.Id,Reported.Fiscal.Year)
df2<- df1 %>% group_by(Person.Id,Reported.Fiscal.Year) %>% mutate(year=row_number())
But this results in a data frame that looks like this (essentially counting the occurrences of each year by ID):
Person.Id Reported.Fiscal.Year year
250 2017 1
250 2017 2
250 2018 1
300 2018 1
511 2019 1
300 2018 2
700 2017 1
Upvotes: 1
Views: 313
Reputation:
If I understand correctly, you want to enumarate the occurences of IDs accross the years?
I have used pieces of your code, you were close. Only you need to choose distinct rows to count the occurences with:
arrange()
both columns,group_by()
IDs to count fiscal years for each ID,distinct()
rows, i.e. unique combinations of ID and fiscal year,mutate()
with row_number()
as you did,See comments inside the code:
library(dplyr)
# your example data
df <- read.table(header = TRUE, text = "
Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017
")
# 1. arrange by ids and years (this is what you did)
# 2. group by ids to be able to count different fiscal years
# 3. choose only unique combinations of ids and fiscal years
# 4. use row numbers (as you did)
# 5. merge new column to original data
df %>%
arrange(Person.Id, Reported.Fiscal.Year) %>%
group_by(Person.Id) %>%
distinct() %>%
mutate(year = row_number()) %>%
inner_join(df, .)
#> Joining, by = c("Person.Id", "Reported.Fiscal.Year")
#> Person.Id Reported.Fiscal.Year year
#> 1 250 2017 1
#> 2 250 2017 1
#> 3 250 2018 2
#> 4 300 2018 1
#> 5 511 2019 1
#> 6 300 2018 1
#> 7 700 2017 1
Created on 2020-07-06 by the reprex package (v0.3.0)
Upvotes: 2
Reputation: 1428
Here's an alternative to @Petr & @Bruno's very nice join-based solutions. This one works by building a cumulative count of unique years for each person.
library(readr)
df <- read_table("Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017")
library(dplyr)
df %>%
arrange(Person.Id, Reported.Fiscal.Year) %>%
group_by(Person.Id) %>%
mutate(year = cumsum(!duplicated(Reported.Fiscal.Year)))
#> # A tibble: 7 x 3
#> # Groups: Person.Id [4]
#> Person.Id Reported.Fiscal.Year year
#> <dbl> <dbl> <int>
#> 1 250 2017 1
#> 2 250 2017 1
#> 3 250 2018 2
#> 4 300 2018 1
#> 5 300 2018 1
#> 6 511 2019 1
#> 7 700 2017 1
Created on 2020-07-06 by the reprex package (v0.3.0)
Upvotes: 2
Reputation: 4150
Welcome to SO!
Had to summarise your data before, maybe someone can provide a simpler solution
library(tidyverse)
df_example <- read_table("Person.Id Reported.Fiscal.Year
250 2017
250 2017
250 2018
300 2018
511 2019
300 2018
700 2017")
df_example_summary <- df_example %>%
group_by(Person.Id,Reported.Fiscal.Year) %>%
summarise(number_reports = n(),,.groups = "drop_last") %>%
mutate(Year = row_number()) %>%
ungroup()
df_example %>%
left_join(df_example_summary)
#> Joining, by = c("Person.Id", "Reported.Fiscal.Year")
#> # A tibble: 7 x 4
#> Person.Id Reported.Fiscal.Year number_reports Year
#> <dbl> <dbl> <int> <int>
#> 1 250 2017 2 1
#> 2 250 2017 2 1
#> 3 250 2018 1 2
#> 4 300 2018 2 1
#> 5 511 2019 1 1
#> 6 300 2018 2 1
#> 7 700 2017 1 1
Created on 2020-07-06 by the reprex package (v0.3.0)
Upvotes: 2