Lara
Lara

Reputation: 3

R - Sum rows from different column based on condition in grouped values

I have a large dataset of different sites inside and out of Natura 2000 network.

An example:

df

Name Total_Surface N2000 SurfaceN2000
A 1 Yes 0.5
B 5 No NA
C 11 No NA
D 10 Yes 5

I want to have the sum of surfaces grouped by the N2000 factor (one total surface for "Yes", another for "No"). However, if the site is inside the N2000 network (N2000=Yes), I want to take the data from the column "SurfaceN2000".

How can I make a sum that depending on the N2000 factor it takes the values from Total_Surface or SurfaceN2000?

I know I could create a column with the sum of Total_Surface and another one with SurfaceN2000 but I was wondering if there was a more automatic solution.

I tried a number of things but they didn't work. The last thing I tried was

df %>% 
  group_by(N2000) %>%
  summarise(surface=
             case_when(N2000== "No" ~ sum(Total_Surface,na.rm = TRUE), 
                       N2000== "Yes" ~ sum(SurfaceN2000,na.rm = TRUE)))

I also tried with filter() or rbind() but it didn't work either.

I was expecting an output table like:

N2000 Surface
Yes 5.5
No 16

Thank you very much in advance! :)

Upvotes: 0

Views: 98

Answers (2)

Friede
Friede

Reputation: 7979

Consider pivoting your data (according to a more sophisticated rule than demonstatred here).

tidyr::pivot_longer(d0, cols = c(Total_Surface, SurfaceN2000)) |>
  dplyr::summarise(Surface = sum(value, na.rm = TRUE), .by = (c(N2000, name)))
# A tibble: 4 × 3
  N2000 name          Surface
  <chr> <chr>           <dbl>
1 Yes   Total_Surface    11  
2 Yes   SurfaceN2000      5.5
3 No    Total_Surface    16  
4 No    SurfaceN2000      0  

You could add a filter to return the desired subset.


Data.

> dput(d0)
structure(list(Name = c("A", "B", "C", "D"), Total_Surface = c(1L, 
5L, 11L, 10L), N2000 = c("Yes", "No", "No", "Yes"), SurfaceN2000 = c(0.5, 
NA, NA, 5)), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 0

Luigi
Luigi

Reputation: 411

Something like this?

library(dplyr)

df %>% 
  mutate(area = ifelse(N2000 == "Yes", SurfaceN2000, Total_Surface)) %>%
  summarise(surface = sum(area, na.rm = TRUE), .by = N2000)

# A tibble: 2 × 2
  N2000 surface
  <chr>   <dbl>
1 No       16  
2 Yes       5.5

Upvotes: 0

Related Questions