Reputation: 301
So this is the data I'm working with:
ID Year State Grade Loss Total
1 2016 AZ A 50 1000
1 2016 AZ A 50 1000
2 2016 AZ B 0 5000
3 2017 AZ A 0 2000
4 2017 AZ C 10 100
2 2017 AZ B 0 3000
What I'm trying to do is create a table that shows the amount of value lost that is grouped by Year, State and Grade. That part I have done but the issue is you can see that there is a duplicated row for ID=1. I need to add a component to my code that removes any duplicated rows like it in my data once I have grouped the data by Year, State and Grade.
The reason I want to remove the duplicates after I have grouped the data is that the ID number may duplicate for a different year but that is OK as that is a new observation. I just want to remove any duplicates if the Year, State and Grade match. Basically if the whole row is a duplicate, it should be removed.
I can't tell if I should use Unique() or Distinct() but here is what I have so far:
Answer <- data %>%
group_by(Year, State, Grade) %>%
filter(row_number(ID) == 1) %>% #This is the part to replace
summarise(x = sum(Loss) / sum(Total)) %>%
spread(State, x)
The output should look like this:
Year State Grade x
2016 AZ A 0.05
2016 AZ B 0
2016 AZ C 0
2017 AZ A 0
2017 AZ B 0
2017 AZ C 0.1
Upvotes: 1
Views: 594
Reputation: 12074
A few things. Below, I use distinct
to remove duplicate rows. Also, in your expected results you have an entry for grade C for 2016, which isn't in your original data. So, I used complete
to add this (and any other missing cases) as a zero. Finally, as @akrun notes above: where does 0.00833
come from? Typo or have I misunderstood the calculation?
df <- read.table(text = "ID Year State Grade Loss Total
1 2016 AZ A 50 1000
1 2016 AZ A 50 1000
2 2016 AZ B 0 5000
3 2017 AZ A 0 2000
4 2017 AZ C 10 100
2 2017 AZ B 0 3000", header = TRUE)
Answer <- df %>%
distinct %>%
group_by(Year, State, Grade) %>%
summarise(x = sum(Loss) / sum(Total)) %>%
complete(Year, State, Grade, fill = list(x = 0))
# # A tibble: 6 x 4
# # Groups: Year, State [2]
# Year State Grade x
# <int> <fct> <fct> <dbl>
# 1 2016 AZ A 0.05
# 2 2016 AZ B 0
# 3 2016 AZ C 0
# 4 2017 AZ A 0
# 5 2017 AZ B 0
# 6 2017 AZ C 0.1
Upvotes: 3