Kardinol
Kardinol

Reputation: 301

How can I filter out Duplicated Rows per Group

So this is the data I'm working with:

ID    Year    State    Grade    Loss    Total    
1     2016    AZ       A        50      1000
1     2016    AZ       A        50      1000
2     2016    AZ       B        0       5000
3     2017    AZ       A        0       2000
4     2017    AZ       C        10      100
2     2017    AZ       B        0       3000

What I'm trying to do is create a table that shows the amount of value lost that is grouped by Year, State and Grade. That part I have done but the issue is you can see that there is a duplicated row for ID=1. I need to add a component to my code that removes any duplicated rows like it in my data once I have grouped the data by Year, State and Grade.

The reason I want to remove the duplicates after I have grouped the data is that the ID number may duplicate for a different year but that is OK as that is a new observation. I just want to remove any duplicates if the Year, State and Grade match. Basically if the whole row is a duplicate, it should be removed.

I can't tell if I should use Unique() or Distinct() but here is what I have so far:

   Answer <- data %>%
        group_by(Year, State, Grade) %>%
        filter(row_number(ID) == 1) %>% #This is the part to replace
        summarise(x = sum(Loss) / sum(Total)) %>%
        spread(State, x)

The output should look like this:

Year    State    Grade    x
2016    AZ       A        0.05
2016    AZ       B        0
2016    AZ       C        0
2017    AZ       A        0
2017    AZ       B        0
2017    AZ       C        0.1

Upvotes: 1

Views: 594

Answers (1)

Dan
Dan

Reputation: 12074

A few things. Below, I use distinct to remove duplicate rows. Also, in your expected results you have an entry for grade C for 2016, which isn't in your original data. So, I used complete to add this (and any other missing cases) as a zero. Finally, as @akrun notes above: where does 0.00833 come from? Typo or have I misunderstood the calculation?

df <- read.table(text = "ID    Year    State    Grade    Loss    Total    
1     2016    AZ       A        50      1000
1     2016    AZ       A        50      1000
2     2016    AZ       B        0       5000
3     2017    AZ       A        0       2000
4     2017    AZ       C        10      100
2     2017    AZ       B        0       3000", header = TRUE)

Answer <- df %>%  
  distinct %>% 
  group_by(Year, State, Grade) %>%
  summarise(x = sum(Loss) / sum(Total)) %>%
  complete(Year, State, Grade, fill = list(x = 0)) 

# # A tibble: 6 x 4
# # Groups:   Year, State [2]
#    Year State Grade     x
#   <int> <fct> <fct> <dbl>
# 1  2016 AZ    A      0.05
# 2  2016 AZ    B      0   
# 3  2016 AZ    C      0   
# 4  2017 AZ    A      0   
# 5  2017 AZ    B      0   
# 6  2017 AZ    C      0.1 

Upvotes: 3

Related Questions