Reputation: 35
I have a dataset with columns DX1 to DX10 with character variables coded with iCD9codes for example I would like to filter my data set with the codes 4289, 4281, 4282 that lies anywhere from DX1 to DX10
Any easier way to do that using dplyr?
Upvotes: 0
Views: 1216
Reputation: 35
z<-Y %>%
filter(DX1%in% c("4289", "4281","4282") | DX2%in% c("4289", "4281", "4283") | DX3 %in% c("4289", "4281", "4283") | DX4 %in% c("4289", "4281", "4283")| DX5%in% c("4289", "4281", "4283")| DX6 %in% c("4289", "4281", "4283") | DX7%in% c("4289", "4281", "4283") | DX8%in% c("4289", "4281", "4283")| DX9%in% c("4289", "4281", "4283")| DX10%in% c("4289", "4281", "4283"))
Upvotes: 0
Reputation: 578
Assuming you want to retain all the columns from your original dataset, there are 2 ways to go around this:
1. Use filter_all
to filter all columns. This is assuming that your dataset contains only from DX1 - DX10
Input:
set.seed(10)
df <- tibble(DX1 = sample(4280:4380, 10),
DX2 = c(4281, sample(4200:4500, 9)),
DX3 = sample(4270:4310, 10),
DX4 = sample(4280:4295, 10))
df
# A tibble: 10 x 4
DX1 DX2 DX3 DX4
<int> <dbl> <int> <int>
1 4331 4281. 4304 4285
2 4310 4396. 4310 4288
3 4322 4370. 4293 4281
4 4347 4233. 4299 4282
5 4288 4377. 4283 4290
6 4301 4306. 4284 4284
7 4306 4326. 4294 4287
8 4305 4215. 4298 4289
9 4337 4277. 4277 4294
10 4319 4316. 4309 4291
Method:
df %>% filter_all(any_vars(. == 4282 | . == 4281 | . == 4289))
Output:
# A tibble: 4 x 4
DX1 DX2 DX3 DX4
<int> <dbl> <int> <int>
1 4331 4281. 4304 4285
2 4322 4370. 4293 4281
3 4347 4233. 4299 4282
4 4305 4215. 4298 4289
2. Use filter_at
to filter certain columns. This is assuming that your dataset also contains other columns besides DX1 - DX10.
Input:
set.seed(124)
df2 <- tibble(DX1 = sample(4280:4380, 10),
DX2 = c(4281, sample(4200:4500, 9)),
DX3 = sample(4250:4350, 10),
DX4 = sample(4280:4300, 10),
AA1 = sample(4280:4295, 10),
AA2 = c(sample(4500:5500, 9), 4289))
df2
# A tibble: 10 x 6
DX1 DX2 DX3 DX4 AA1 AA2
<int> <dbl> <int> <int> <int> <dbl>
1 4288 4281. 4253 4288 4285 4649.
2 4320 4432. 4312 4296 4282 4920.
3 4331 4457. 4309 4285 4291 5335.
4 4318 4426. 4257 4287 4288 5020.
5 4301 4453. 4290 4294 4294 4717.
6 4308 4321. 4282 4281 4284 4746.
7 4335 4216. 4269 4299 4283 4864.
8 4326 4370. 4328 4282 4292 4731.
9 4365 4419. 4274 4292 4281 5239.
10 4305 4459. 4326 4295 4289 4289.
Method:
df2 %>% filter_at(vars(starts_with("DX")), any_vars(. == 4282 | . == 4281 | . == 4289))
Output:
# A tibble: 3 x 6
DX1 DX2 DX3 DX4 AA1 AA2
<int> <dbl> <int> <int> <int> <dbl>
1 4288 4281. 4253 4288 4285 4649.
2 4308 4321. 4282 4281 4284 4746.
3 4326 4370. 4328 4282 4292 4731.
Upvotes: 1
Reputation: 15062
x <- Y %>%
gather(key = "DX", value = "code", starts_with("DX")) %>%
filter(DX %in% c("4289", "4281", "4282"))
As Frank noted in comments, dplyr
works best when data is tidy, meaning variables are in columns and observations are in rows. Use tidyr::gather
to combine the data into two columns, one key
column named DX
that tells you what column that value came from, and a value
column, here called code
, that is the number in the cell.
Then it is easy to use filter
because there is only one column you need, the new DX
column.
See this chapter on tidy data for more info.
Upvotes: 0