Tariq Nisar
Tariq Nisar

Reputation: 35

subsetting data with multiple conditions using dplyr

I have a dataset with columns DX1 to DX10 with character variables coded with iCD9codes for example I would like to filter my data set with the codes 4289, 4281, 4282 that lies anywhere from DX1 to DX10

Any easier way to do that using dplyr?

enter image description here

Upvotes: 0

Views: 1216

Answers (3)

Tariq Nisar
Tariq Nisar

Reputation: 35

z<-Y %>% 
filter(DX1%in% c("4289", "4281","4282") | DX2%in% c("4289", "4281", "4283") | DX3 %in% c("4289", "4281", "4283") | DX4 %in% c("4289", "4281", "4283")| DX5%in% c("4289", "4281", "4283")| DX6 %in% c("4289", "4281", "4283") | DX7%in% c("4289", "4281", "4283") | DX8%in% c("4289", "4281", "4283")| DX9%in% c("4289", "4281", "4283")| DX10%in% c("4289", "4281", "4283"))

Upvotes: 0

Nik Muhammad Naim
Nik Muhammad Naim

Reputation: 578

Assuming you want to retain all the columns from your original dataset, there are 2 ways to go around this:

1. Use filter_all to filter all columns. This is assuming that your dataset contains only from DX1 - DX10

Input:

set.seed(10)
df <- tibble(DX1 = sample(4280:4380, 10),
             DX2 = c(4281, sample(4200:4500, 9)),
             DX3 = sample(4270:4310, 10),
             DX4 = sample(4280:4295, 10))

df
# A tibble: 10 x 4
     DX1   DX2   DX3   DX4
   <int> <dbl> <int> <int>
 1  4331 4281.  4304  4285
 2  4310 4396.  4310  4288
 3  4322 4370.  4293  4281
 4  4347 4233.  4299  4282
 5  4288 4377.  4283  4290
 6  4301 4306.  4284  4284
 7  4306 4326.  4294  4287
 8  4305 4215.  4298  4289
 9  4337 4277.  4277  4294
10  4319 4316.  4309  4291

Method:

df %>% filter_all(any_vars(. == 4282 | . == 4281 | . == 4289))

Output:

# A tibble: 4 x 4
    DX1   DX2   DX3   DX4
  <int> <dbl> <int> <int>
1  4331 4281.  4304  4285
2  4322 4370.  4293  4281
3  4347 4233.  4299  4282
4  4305 4215.  4298  4289


2. Use filter_at to filter certain columns. This is assuming that your dataset also contains other columns besides DX1 - DX10.

Input:

set.seed(124)
df2 <- tibble(DX1 = sample(4280:4380, 10),
              DX2 = c(4281, sample(4200:4500, 9)),
              DX3 = sample(4250:4350, 10),
              DX4 = sample(4280:4300, 10),
              AA1 = sample(4280:4295, 10),
              AA2 = c(sample(4500:5500, 9), 4289))

df2
# A tibble: 10 x 6
     DX1   DX2   DX3   DX4   AA1   AA2
   <int> <dbl> <int> <int> <int> <dbl>
 1  4288 4281.  4253  4288  4285 4649.
 2  4320 4432.  4312  4296  4282 4920.
 3  4331 4457.  4309  4285  4291 5335.
 4  4318 4426.  4257  4287  4288 5020.
 5  4301 4453.  4290  4294  4294 4717.
 6  4308 4321.  4282  4281  4284 4746.
 7  4335 4216.  4269  4299  4283 4864.
 8  4326 4370.  4328  4282  4292 4731.
 9  4365 4419.  4274  4292  4281 5239.
10  4305 4459.  4326  4295  4289 4289.

Method:

df2 %>% filter_at(vars(starts_with("DX")), any_vars(. == 4282 | . == 4281 | . == 4289))

Output:

# A tibble: 3 x 6
     DX1   DX2   DX3   DX4   AA1   AA2
   <int> <dbl> <int> <int> <int> <dbl>
 1  4288 4281.  4253  4288  4285 4649.
 2  4308 4321.  4282  4281  4284 4746.
 3  4326 4370.  4328  4282  4292 4731.

Upvotes: 1

Calum You
Calum You

Reputation: 15062

x <- Y %>%
    gather(key = "DX", value = "code", starts_with("DX")) %>%
    filter(DX %in% c("4289", "4281", "4282"))

As Frank noted in comments, dplyr works best when data is tidy, meaning variables are in columns and observations are in rows. Use tidyr::gather to combine the data into two columns, one key column named DX that tells you what column that value came from, and a value column, here called code, that is the number in the cell. Then it is easy to use filter because there is only one column you need, the new DX column.

See this chapter on tidy data for more info.

Upvotes: 0

Related Questions