Reputation: 1235
I have a dataset on Coivd-19 cases and deaths by day and country. I wish to find the date when the first death occured for every country, and the filter away all the preceding days. How would you tackle this problem in R/Tidyverse?
library(readxl)
library(httr)
url <- paste("https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-",format(Sys.time(), "%Y-%m-%d"), ".xlsx", sep = "")
GET(url, authenticate(":", ":", type="ntlm"), write_disk(tf <- tempfile(fileext = ".xlsx")))
df <- read_excel(tf)
Upvotes: 1
Views: 92
Reputation: 11
library(dplyr)
df <- df %>%
arrange(DateRep) %>%
distinct(`Countries and territories`, .keep_all = TRUE)
I think this is the simplest option for you. dplyr::distinct()
removes all but the first row, so if you arrange the data by the desired column you'll get the first match only. .keep_all = TRUE
will keep all columns, not just those specified in distinct()
.
Upvotes: 0
Reputation: 1618
Try this:
library(dplyr)
# or library(tidyverse)
df %>%
arrange(`Countries and territories`, DateRep) %>%
group_by(`Countries and territories`) %>%
mutate(Cumulative_Death = cumsum(Deaths)) %>%
ungroup() %>%
filter(Cumulative_Death > 0) %>%
group_by(`Countries and territories`) %>%
mutate(First_Death_Date = min(DateRep))
It adds new columns Cumulative_Death
: sum of death up-to DateRep
and First_Death_Date
: date when the first death occured for every country
Upvotes: 1