Mads Lumholdt
Mads Lumholdt

Reputation: 25

Subset dataset based on dates in entire dataset

I am working with a dataset in R. The dataset has with MANY date variables and I want to subset the data frame based on dates between 2023-01-01 and 2023-12-31.

I know I can use the command based on one variable:

df2023 <- df[df$"variable.name" >= "2023-01-01" & df$"variable.name" <= "2023-12-31", ]

But I need to somehow be able to "screen" all columns. If 2023 appears in any column the entire row/record should be included. Is this possible?

The dataset contains both numerical, character, categorical and date variables.

Thanks.

Upvotes: 0

Views: 88

Answers (3)

Jon Spring
Jon Spring

Reputation: 66880

I'm presuming you have column names that represent dates, and you want to do comparison tests on them. (Those names aren't syntactic, so you could encounter more problems down the road. But in this case we might be ok for now.)

We can do a test on names(df) that outputs a logical vector, which we can feed into df[LOGICAL_VECTOR] to get the subset of columns which match the test.

df <- data.frame(`2023-01-01` = LETTERS[1:10],
           `2023-02-01` = 1:10,
           `2023-03-01` = letters[1:10],
           `2023-04-01` = 11:20,
           check.names = FALSE)

> names(df)
[1] "2023-01-01" "2023-02-01" "2023-03-01" "2023-04-01"
> names(df) >= "2023-03-01"
[1] FALSE FALSE  TRUE  TRUE

df[names(df) >= "2023-03-01"]

Result

   2023-03-01 2023-04-01
1           a         11
2           b         12
3           c         13
4           d         14
5           e         15
6           f         16
7           g         17
8           h         18
9           i         19
10          j         20

Upvotes: 1

G. Grothendieck
G. Grothendieck

Reputation: 270045

Questions to SO should include test data. Since you are relatively new we will provide that using data shown in the Note at the end.

1) For an example we will extract all rows having 2021 in any date column. Compare the year to 2021 and keep the rows for which any of those comparisons are TRUE. Note that the -1 in across means all columns except the first.

library(dplyr)
library(lubridate) # year

dat %>%
  rowwise %>%
  filter(any(across(-1, year) == 2021, na.rm = TRUE) ) %>%
  ungroup

## # A tibble: 2 × 3
##     SSN date_today date_adm  
##   <dbl> <chr>      <chr>     
## 1   101 2021-07-09 <NA>      
## 2   666 1914-01-01 2021-04-07

2) For a base R approach first create a function which converts a character string to a year and then compare it to 2021 determining if any of those comparisons on each row is TRUE and hand that to subset. Note that we use [-1] to exlude the first column since it is not a date column.

yr <- function(x) as.numeric(substr(x, 1, 4))
subset(dat, apply(dat[-1], 1, \(x) any(yr(x) == 2021, na.rm = TRUE)))

##   SSN date_today   date_adm
## 3 101 2021-07-09       <NA>
## 4 666 1914-01-01 2021-04-07

Note

This is from the this link except we have used NA in place of "NA".

dat <- data.frame(
 SSN = c(204,401,101,666,777),
 date_today = c("1914-01-01","2022-03-12","2021-07-09","1914-01-01","2022-04-05"),
 date_adm = c("2020-03-11","2022-03-12",NA,"2021-04-07","2022-04-05"))

dat
##   SSN date_today   date_adm
## 1 204 1914-01-01 2020-03-11
## 2 401 2022-03-12 2022-03-12
## 3 101 2021-07-09         NA
## 4 666 1914-01-01 2021-04-07
## 5 777 2022-04-05 2022-04-05

Upvotes: 2

In your example, you are comparing a string ("2023-01-01") with the values stored in the "variable.name". Strings can not be compared but date objects can be. Probably you can do the following....

df$variable.name = as.Date(df$variable.name) #make sure the variable is date object

df2023 = subset(df,df$variable.name >= as.Date("2023-01-01") & df$variable.name <= as.Date("2023-12-31"))

Upvotes: 0

Related Questions