How to subset dataframe based on multiple variables in R

Question

I have a dataframe of 286 columns and 157355 rows. I wish to subset rows that contain one or more of several defined factor variables such as F32, F341 etc. Once this has been completed, I wish to identify which other factor variables are most common in the subset rows.

I have tried to filter for values of interest but an error messages appears saying the data must be numeric, logical or complex, for example;

d<- a %>%
  filter_at(vars(f.41202.0.0:f.41202.0.65), all_vars('F32'))

I also tried this, but the resulting dataframe had no values present;

f <- a %>%
  rowwise() %>%
  filter(any(c(1:280) %in% c('F32', 'F320', 'F321', 'F322', 'F323', 
                             'F328', 'F329', 'F330', 'F331', 'F332', 
                             'F333', 'F334', 'F338', 'F339')))

the same occurred when I tried to place all relevant variables into an ICD object;

f <- b %>%
  rowwise() %>%
  filter(any(c(1:286) %in% ICD))

I would greatly appreciate any suggestions, thanks

my data looks like this (sorry I can't find a way to format it better on this page);

Row.name Var1 Var2 Var3 Var4

1 F3 NA NA M87

2 NA NA M87 NA

3 NA F3 NA K17

4 NA NA F3 M87

After sub-setting rows based on F3 it should look like this;

Row.name Var1 Var2 Var3 Var4

1 F3 NA NA M87

3 NA F3 NA K17

4 NA NA F3 M87

so the same variable columns are retained, but rows without F3 are removed

then I would hope to list the other variables (other than F3) based on how common they are within that subset, in this case that would be

most common: M87

2nd most common: K17

If it helps, I am trying to identify individuals with a particular disease, then I will try to find out which other diseases those individuals most commonly have

thanks for the help

Ben · Accepted Answer

If you wish to use tidyverse, you can use filter_all to look at all of the columns. Then, check if any_vars are in a vector of diagnostic codes. In my example, I look at F3 and F320.

Afterwards, if you want to count up the number of diagnosis codes, you could reshape your data from wide to long, and then count frequencies. If you wish, you can remove NA by filter. Let me know if this is what you had in mind.

df <- data.frame(
  Var1 = c("F3", NA, NA, NA),
  Var2 = c(NA, NA, "F3", NA),
  Var3 = c(NA, "M87", NA, "F3"),
  Var4 = c("M87", NA, "K17", "M87")
)

library(tidyverse)

df %>%
  filter_all(any_vars(. %in% c("F3", "F320"))) %>%
  pivot_longer(cols = starts_with("Var"), names_to = "Var", values_to = "Code") %>%
  filter(!is.na(Code)) %>%
  count(Code, sort = TRUE)

After the filter, you should have:

  Var1 Var2 Var3 Var4
1   F3    M87
2    F3   K17
3     F3  M87

After pivot_longer and count:

# A tibble: 3 x 2
  Code      n
   
1 F3        3
2 M87       2
3 K17       1

Side note: if you wish to filter based on only some of your variables (instead of selecting all variables), you can use filter_at instead, such as:

filter_at(vars(starts_with("Var")), any_vars(. %in% c("F3", "F320")))

How to subset dataframe based on multiple variables in R

Answers (1)

Related Questions